Hacker News new | comments | ask | show | jobs | submit login
What Data Scientists Really Do, According to Data Scientists (hbr.org)
278 points by pseudolus 6 months ago | hide | past | web | favorite | 136 comments

"The only difference between screwing around and science is writing it down" - Adam Savage.

In all seriousness, why can't data science simply be about applying the scientific method in the realm of data analysis? It doesn't need to be conflated with machine learning, BI, SQL, etc. It can just be about approaching data analysis with scientific rigor.

My opinion is that the term data science evolved when we started needing cross-functional people who are a blend of:

- domain experts;

- numerical/quantitative specialists (such as statisticians, mathematicians, physicists, STEM people);

- business analysts, business intelligence; and,

- those who traditionally deal with data management, platforms and tools.

That confluence of people was needed amidst the related trends:

- increased government funding for STEM education and brain research;

- marketing from companies such as IBM ("Watson"), the democratization of data and increase in the use of data in daily life;

- the big data wave, subsequent interest in "internet of things" and "digital transformation";

- renewed interest in machine learning and AI (recurrent neural networks and other breakthroughs);

- and others of course..

We needed to apply more discipline to data analysis - thus data science was born. A formalizing of what many were already doing, to capture the need and changing paradigm. Or so I like to believe.

> why can't data science simply be about applying the scientific method in the realm of data analysis?

Because then someone with a business school education (and zero formal statistical training) wouldn't be able to do it. I joke about waiting for finance's Excel models to be rebranded as AI, as I've already seen a handful of hedge funds rebrand their analysts as data scientists.

A large number of hedge fund analysts are data scientists. Just because computational finance models used a tool that abstracted away a large portion of the programming doesn't mean that they weren't using applied statistics and domain specific modeling to solve problems.

> a large number of hedge fund analysts are data scientists

Maybe I'm being a curmudgeon. In my book if you can't build the statistical tool you're using, you don't understand it. So if, in Excel, you can fit a regression from "scratch", (i.e. not using any built-in regression functions) and use the built-in functions for convenience, that's fine. If you can't, you're a regular financial analyst. (Nothing wrong with that. I was one once.)

This is important because being able to build it means being able to tweak it. Excel's tools have quirks and make built-in assumptions about your data. If those assumptions don't hold, you should be able to tweak (or change) your approach. Being limited to built-in models removes that flexibility. It also implies you don't know when you're crossing between "my tool works" and "my tool is outputting garbage."

> not using any built-in regression functions

This is absolute pure silliness. So the guy down the street that can't manufacture a new carburetor isn't a mechanic? The doctor that can't create penicillin in his personal office isn't a medical professional?

More often than not, the people that create new systems are not the same people that take these concepts and apply them to practical business problem. It takes many kinds of people to help a company grow.

Hadley Whickam also stressed the importance of readability and reproducibility of code vs software like Excel. A sequence of manipulations and button presses doesn't leave a clear history for others in your team to understand or quickly test and reproduce your work to validate, or apply to be data sets.

For work to be "science", it really ought to be transparent and reproducible within your community (eg, team within your firm).

This is an important point too. I'm still amused by the story of the "Growth in a Time of Debt" paper. Two famous Harvard economist made an Excel mistake (among other things) that completely changed the conclusion of their study and ultimately affected real-world economic policies.

Here is a summary of the story: https://www.nytimes.com/2013/04/19/opinion/krugman-the-excel...

>So if, in Excel, you can fit a regression from "scratch", (i.e. not using any built-in regression functions) and use the built-in functions for convenience, that's fine.

I suppose it depends on what you mean by "can". Much like whiteboard interviewing, I bet I can't sit down today and implement a regression from scratch, because I'm out of practice. I have had no need to retain that information; there are tools that do it for me.

Could I do it quickly with a stats 101 textbook in my lap (or 10 minutes of Googling)? Absolutely.

I don’t think most “data scientists” can do b = (X′X)^(−1) X′ Y right off the top of their heads either.

Let's take a simple example, linear regression. You can fit a linear model with w few button clicks in Excel and then look at R² to check if it's high enough and call it a day. Whereas in R you'd be able to easily check a bunch of other metrics to assess your model (ANOVA, plot the residuals, etc). When there's a lot at stake, you want to make the decision yourself and not just rely on the output of a black box.

All of the things you're taking about are possible to do with Excel. Of course I prefer R, but at some level we all abstract away algorithmic responsibility and return granularity.

You can also easily obtain summary statistics and ANOVAs, etc, in Excel with a few clicks.

> You can also easily obtain summary statistics and ANOVAs, etc, in Excel with a few clicks

To be clear, my original comment was not criticizing Excel. It pokes fun at analysts who couldn't tell you a Type I error from a Type II being branded as "data analysts" because they build DCF models.

I think a relevant analogy from back in the day is being able to use frontpage and then calling yourself a web developer.

Certainly it has good features to create what you want, and you could edit source (if you can read it!)...but it's used by people who have a elementary knowledge of web development.

> why can't data science simply be about applying the scientific method in the realm of data analysis?

That's what a statistician do.

I've seen these ML and Datascience people. And the majority the time how they tackle data is radically different from statistician and is more of an art than a science compare to what statistician does.

But this could be my bias opinion and just some small data sample from personal experiences.


Actually my last day of internship I've met a few statistician interns some of them are from Cal (UCBerkely) and they came to the same conclusion (we have a lot of complaints). The ML/DS group is really just doing black magic (nicest way of putting it). I wish statistic is better at marketing. Oh well.

>> why can't data science simply be about applying the scientific method in the realm of data analysis? >That's what a statistician do.


Run that experiment for me next time you meet a statistician:

    - ask him if he can apply Chi-squared to a decision problem

    - ask him if he can *explain* how and why Chi-squared works.
In my experience, all statisticians can do the first, almost none can do the second.

Learning how to use a screwdriver to screw screws without understanding notions of torque and moment doesn't mean you're applying the scientific method.

I would argue that they should be able to understand it to the level that they can at least defend Chi-squared as a tool for the problem at hand. Then, they should be able to evaluate whether or not it works correctly.

If a medical researcher is testing a new radio-therapy treatment, but can't mathematically model every fission problem you can throw at them, they're still applying the scientific method.

> In my experience, all statisticians can do the first, almost none can do the second.

I think a phd statistician can do this.

Master statistic does not touch upon field and measure theory in statistical inference so many questions get unanswered. But I suspect you may be correct.

I view chisq as a statistical distance for most of my encounter and learning.

In my humble experience as a data scientist at a big tech company, a big differentiating factor is familiarity with the R or Pydata stack. It's not just its own language and library, it's a more general idiomatic way toward approaching and solving problems from a "software first" perspective.

I sometimes work with people who are formally trained in stats at a much higher level than myself. But with my lesser ability combined with my capability to write production grade Python code using the pydata stack, my code and solutions are more likely to make it in.

I think we're talking about different roles.

I'm coming at it from just a modeler role and it seems like you are stating more of a full stack within the data science. I took OP remark as if it was just within data analyst in term of analysing the data and not including turning it into some kind of application.

As for Python, time series isn't as good in python. I do agree that Python production is good. But I don't believe Python have better analysis packages. I'm in the camp of using programing language for their strength and using Apache Thrift or whatever to tie everything in.

> As for Python, time series isn't as good in python

Time series isn't as good

Linear regression isn't as good

Mixed models are awful

Good luck fitting a spline without diving into scipy and doing it as an interpolation

R packages are almost always accompanied with a rigorous paper and a great vignette, whereas in Python, there are almost no talks of the implementation and just documentation about how you can use the library.

Every model we develop has assumptions and shortcomings, R offers the tools to diagnose and examine that, whereas in Python, I get the feeling that it's more like "some smart guys figured out this formula, here's an implementation of it".

I agreed, I didn't want come off as a disgruntle statistician and list the weakness of Python.

But you're right. I find most Springer, CRC, etc.. books are in R with accompanying R packages. Shumway & Stoffer Time series book comes with the R astra package. Andrew Gelman Bayesian create Stan and their target was R first (rstan). The Sanford people who created lasso and elasticnet created glmnet. The thought that the creators or experts of these subject wrote a book or research paper and then also publish R package to accompany it is reassuring.

When it comes to just data analysis, it seem like R is pretty good or better than Python. I think Python is much better for ML stuff such as deep learning and also it seems like Python have better NLP support. Arguably Stanford is very good with NLP and they publish their tool in Java.

if you really understand ML/DS, then you would know that ML is founded on all the same concepts.

> "... they tackle data is radically different from statistician ..."

the short answer is, before we had limited data, compute, and the problems businesses needs increased substantially.

now we can do better things, and if you cant do more than what companies were doing in the 1950s, then your value proposition is significantly less than someone that's using industry practices established in 2017 / early 2018.

> "I wish statistic is better at marketing. Oh well."

as a data scientist, I have to know stats, ML, big data tools (spark/splunk/hdfs), programming, and domain specific knowledge. some data scientists are just rebranded statisticians, which is fine. the role of a data scientist varies to the needs of the org they are in. I mostly do anomaly detection and classification, while others may focus solely on AB testing.

> > "> why can't data science simply be about applying the scientific method in the realm of data analysis?"

the statement doesnt make sense to me. if you are doing data analysis then by default you are doing observations and measurement on empirical data. As far as experimentation goes, almost no one has the resources or the standing to do that. Some do, like those AB testers I mentioned before.

> "Actually my last day of internship I've met a few statistician interns some of them are from Cal (UCBerkely) and they came to the same conclusion (we have a lot of complaints). The ML/DS group is really just doing black magic (nicest way of putting it)"

Then they don't appreciate ML and the problems they are trying to tackle. It reminds me of a stats PHD co-worker that was upset about the number of parameters in an image classification problem we had. they didn't think a model could be ran that had more parameters than observations. I was like we are not running linear models ... As an experiment, go to kaggle and try the spooky author classification competition. See how far you can get with basic stats and then see how much further you can get with ML. hopefully it will give you an appreciation for some of the ML tools available.

It's not black magic, like many other domains; its just a dense subject that takes a lot of time to understand

Yeah, most of this thread has kinda garbage responses.

Even if you did a CS degree and then did a BMath with a statistics major you still wouldn't have all the skills you would need, though you'd have a great place to start off from. There is an art in kinda guiding / interpreting the mathematics that's hard to teach and most people in the field kinda pick it up through experimentation. I'm sure it will get more formalized at some point, but I don't think the formalization will be all mathematical or scientific; I think much of it will be making explicit how to think about certain domains or a system of process that are useful.

Also I think there is a big difference between Data Engineers, Data Analysts, Data Scientists, and AI researchers.

Depending on the field, an AI researcher deals more with idealized forms and is much stronger on the mathematics of machine learning.

Data engineers are usually stronger on the CS topics that deal with scale.

Data analysts, even the very best ones, tend to be weaker on the CS, the math, but tend to be the strongest at quickly understanding data and making it intelligible to decision makers. I'm not knocking it—I was an analyst at one point—it's just the truth that some positions take more skill than others and most great data analysts tend to treat it as a 3 or 5 year stop before levelling up to management or a more technical role.

Data Scientists generally have the skills of a data analyst (though they have trouble dumbing things down at times if they came here from something other than an analyst position) with some of the skills from the other two.

The way some application code software developers dismiss a whole sub-discipline is kinda embarrassing.

Is business intelligence considered data science? I've never really understood. Thanks.

In social sciences (speaking as an economist), the goal of statistics was always testing theories and hypotheses against empirical evidence. Once you have a solid theory, then you have an understanding of the subject matter, which allows you to make predictions and counterfactual hypotheses. So the primary goal was understanding the data generating process, finding causal relationships between variables, and so on. The goal was specifically not making predictions. One of the first things I learnt was that a bad model can make good predictions (for a while). The ML crowd appears to be taking the opposite approach, focusing on making predictions and disregarding everything else. I suppose it has its uses, but I wouldn't call it science.

This, and your other responses in this thread, come off as rather disappointing to me, as someone who considers the work they do as "data science". Machine learning as a method clearly has quite a bit to contribute to the business world based on revenue alone. My argument could rest here.

But it's also disappointing the way that you belittle all machine learning practitioners, even those with academic credentials, for their work not being worthy of serious consideration. This also sounds a little defensive and projective, and I can't imagine it's easy seeing the forest for the trees with your head in the clouds.

Trying to walk on a rope is pretty hard balancing act.

I do want to give a personal view and constructive criticism but any intentional and direct attack toward another domain is not my intention.

I think statistic is very well equipped to do just data analysis. The discipline have many weaknesses I do acknowledge that but I don't believe data analysis is one of it when it is the core tenet of what statistic is.

I believe data science is too new and is still trying to find it's standing. Also it seems like a jack of all trade and a master of none discipline. I don't believe the discipline can be a master of everything including data analysis with all other things it's trying to incorporate.

My critique is that for the original post is that there is already a field for data analysis. Just for it, for a century now, it's named statistic.

ML/DS is a new breed that is more than just data analysis. Because of this they can do many things but I don't believe they are an expert at any one thing. Which is fine. But I do understand that this discussion is sensitive since people within the DS/ML since that is how they make money and earn a living.

How good ML and Datascience people that you have met were at statistics and in math in general?

The amazing thing is that the black magic gets results (for certain categories of problems).

I've seen them do random forest on temporal data.

They tried to fix it with PCA.

What do you mean by "on temporal data"?. For example, no feature extraction was performed? This sounds pretty amateurish, and completely below the understanding levels of the many machine learning papers that I've read - which is admittedly a small percentage of what has been published.

I've seen "them" test in production. Every field has varying levels of skill.

I've written about this problem before as well[1] - data science seems to be defined uniquely narrowly (and confusingly) compared to other fields.


"Data science" is the most natural name for this field. Though fields like "information science" and "political science" are broad, "data science," as it is popularly defined, is uniquely narrow. This is problematic because general fields typically serve as a roadmap for all subfields - a cursory glance at what they are and how they relate. "Data science" today does not provide this road map.

[1] https://alexpetralia.com/posts/2016/6/22/reclaiming-the-term...

Seems like "Data Scientist" as a job title is too generic.

It's like saying "Web Designer" nowadays, when in reality we have a variety of job specializations like UX strategist (ux only), graphics designer (photoshop), UI developer (html/css only), front-end developer (html/css/js), etc.

> In all seriousness, why can't data science simply be about applying the scientific method in the realm of data analysis?

I suppose someone who isn't very smart is staffing data analysts with little understanding of science. Most would assume that if you are getting paid to analyze all this valuable data, you would have some grounding in the scientific method.

Tangential: I loved the appearance of the term, "Data Scientist". Scientists are domain experts. A biologist knows cell membranes. A chemist knows valences. For decades statisticians had been pitching that they were helpful without knowing the domain: appear-deliver-and-run consultants. It was wonderful to see a group embrace the importance of knowing the system that is producing the data. I regret the more recent movement of "data science" consultants who try to run models and AI without understanding the systems the data came from.

The models and AI have reached self-licking-ice-cream-cone status. All of the work in that area is focused on analyzing the models rather than what is being modeled.

This : totally, we don't understand data, or data bases. We've only had them for 20 years really, we don't understand the lifecycle or how value is accumulated or destroyed and we don't understand the composite behaviour or the dynamics with the users.

There is a crossover : data science is often about constructing data resources from other data resources (and then advanced analysis like : count how many x) doing this rigorously and efficiently and with regard to the underlying infrastructures and other users (don't kill production) is a big trick.

I think databases and analysis have been around a little longer than 20 years...

Not in their current form : we got relational systems in 1970, but data warehousing came in in the mid 90's. The use of data for multiple purposes (not just to underpin one grand application) is relatively new - as is the practice of using data for a purpose for which it was not gathered... remember all the "all the statistician can do is pronounce what has gone wrong" stuff?

This is fairly accurate in my experience: https://twitter.com/thesmartjokes/status/684286479401652224

Day to day, it's mostly SQL, or worse Hive queries which makes most things much slower than they should be.

One thing rarely discussed with the rise of big data is how to do efficient querying, especially at scale.

I've had a ton of data science interviews which ask how to reimplement binary search from scratch (which I would never do on the job), but not anything about how to do efficient JOINs and query nesting.

exactly. Optimising queries becomes really a critical part of the job when you make complex JOINS on millions of records. Just getting the data can take a huge amount of time before you can even consider models.

I’ve been told by IT from numerous organisations that Hadoop will solve all of our team’s query inefficiencies.

Also hence why we introduce new members of the team to learn how to do efficient queries and joins. And spend time upfront to structure their problems.

I work as a biostatistician and I've been tasked recently with querying large databases using SQL, in addition to analysing the data. However, my programming background is very limited and thus I'm sure my queries are very inefficient.

Could you point me to some materials/texts about how to improve querying efficiency for SQL? If it's oriented for beginners then that would be ideal.

Thanks in advance.

This site is great.


Thank you! I've already started reading it and it seems to be just what I needed.

Dumb question: is Splunk implemented with Hive?

No. Splunk is it's own proprietary thing. They actually work somewhat similarly (in that they're both query engines over distributed data using a map-reduce paradigm) although a closer parallel is lucene/elasticsearch.

I recently found out that my company’s information security and IT ops team have licensed Splunk. They use it to analyze machine logs, firewall logs, etc.

My team focuses on traditional reporting (data in oracle, manipulated with SAS, sometimes statistical modeling with SAS, resulting in reports/dashboards presented with QlikView).

The IT team approached me to ask if we wanted to explore any use cases of Splunk for my team. I’ve looked around a bit and don’t understand this stuff enough to know if there are any. We generally are working with health insurance claims data, and have no problems being able to process the data we need using oracle/sas.

Any ideas spring to mind that might make me want to invest time in looking at this?

“It has been a common trope that 80% of a data scientist’s valuable time is spent simply finding, cleaning, and organizing data, leaving only 20% to actually perform analysis.”

Is it really trope? For my experience I almost think collecting data is >80%.

80% of time is spent on getting and cleaning the data, 20% of time is spent preparing and delivering reports and presentations.

The super-cool ML stuff that attracts people to the field in the first place, accounts for little more than a rounding error in how the time is really spent

I get to spend 90% collecting, cleaning and tagging data.

Sounds like poor role definition? You don't have your laboratory scientist cleaning lab equipment or ordering reactants.

https://dremio.com tries to solve this exact problem. It’s open source too.

so I checked out the website and I still don't get what Dremio actually is...

I know what I need which is a wide open easy to use (think CEO's personal assistant who has zero background in anything other than basic business processes easy) visual ETL tool.

We have found some decent VERY expensive and VERY user unfriendly tools but nothing cheaper than just hiring a person to custom handle each batch.

Basically...we're stuck with Cognos, et al.

- years (continuous endeavor): find existing data sources

- months: convince management to give access to data source

- weeks: try to find the connection string

- days: clean up the data (mostly converting dates to yyyy-mm-dd) and importing/exporting csv files

- hours: load data in database, write simple SQL query and simple visualisation

- seconds: brief moment of satisfaction

I just started reading Weapons of Math Destruction by mathematician Cathy O'Neil. She warns about big data systems that codify racism and classism from flawed data and self-fulfilling feedback loops. The systems' "unbiased" decisions are opaque, proprietary, and often unchallengeable.


If "big data" informs you that a group of people (race, geographic location, nationality, occupation, gender, etc) is less likely to say, pay back a loan, or successfully complete 4 years at university, or avoid insurance claims... must that be anything-ist? Or just a fact? Or, as your post suggests, are you obligated to throw out the conclusion and just assume the inputs must have been "flawed data"?

I think, perhaps, you misunderstood the OP's objection, particularly with respect to feedback loops.

It may not be safe to assume the inputs are biased [1], but, then, it's also not safe merely to assume that they aren't.

What's particularly dangerous is to assume unbiased inputs if the subsequent conclusions (the outputs) then influences future inputs [2]. That's the feedback loop that can serve to amplify biases in the data/conclusions with each cycle.

> must that be anything-ist? Or just a fact?

To answer your question, no, it doesn't have to be anything-ist and is, of course, a fact. However, to characterize it as just a fact ignores that it may well be an anything-ist fact, if it had been based on anything-ist data.

Ultimately, this is just a GIGO [3] problem, a concept from early in "computing" (aka "data processing").

[1] which I'm substituting for "flawed", since "unbiased" was the OP's term

[2] for example, by setting interest rates which have an influence on loan repayment likelihood

[3] https://en.wikipedia.org/wiki/Garbage_in,_garbage_out

Big data may conclude that the data shows black people are more likely to commit crimes, but it doesn’t tell you the why, which may be that black people are systematically opressed, more likely to come from a poorer background, discriminated upon so as to reduce their options in life. The AI, based on statistics can tell you the effects, but not the causes.

So if i were to go and beat up every person named John, and then machine learning takes a crack at it and tells us that people named John are more likely to get injured, we may end up discriminating against Johns without realizing it was I who just have a thing against people named John. If this happens in a system and creates a feedback loop, It can lead to something becoming a self-fulfilling prophecy when it need not have been.

Based on a naive conclusion the solution may be to deny health coverage to people named John, but obviously the real solution is to put me in jail.

Regardless of the data used, not knowing the why does not invalidate a strong correlation at all. If you stopped at doing thing only when you know the why there would no progress in science, because there are always new WHYs you uncover as you go.

That's why when studies that make discoveries of new correlations, they have to replicate such findingers in different study designs that actually allow you to say something about causation.

In medicine, that would often mean a randomized study between treatment A and a placebo.

However, with factors such as race, or sex this is obviously impossible.

Intead of looking at race or sex, looking at individual behaviors usually explains things better but it is harder to capture.

> which may be that black people are systematically opressed, more likely to come from a poorer background, discriminated upon so as to reduce their options in life

Also know as a cofounding effect. All of these problems have been encountered by statisticians decades ago.

Correcting for confounding is probably the hardest thing to do when you are building statistical models. There is no true consensus with regard to how you should go about correcting for confounding.

Did you know that if you correct wrongly for a confounder, you actually introduce bias?

As a very modest medical researcher myself, I have become very careful about the conclusions I make with any model I make. While it is very appealing to make conclusions about causation, they are very often wrong.

For more information, I think this might interest you: https://www.hsph.harvard.edu/miguel-hernan/causal-inference-...

This topic has been discussed at length on HN and has been the topic of many flame wars.

If you're genuinely curious for an answer, in the United States it's absolutely the case that you are legally

> obligated to throw out the conclusion

when it comes to many of the things you mentioned (e.g., you can't consider race, or often gender and sexual orientation depending on the state, when you evaluate a loan/employment application).

The reason for those laws is that, historically, overtly and openly racist people used the loan application process to discriminate in housing.

Asking to be able to use any accurate statistical model for any application is an inherently political request, because bad actors in the past have attempted to exclude people based upon skin color. We can't make that discrimination legal whenever someone can dream up a plausible mathematical model justifying an ultimately racial intent.

The real point of my post is that a discussion on this topic devolves into a flame war unless everyone agrees that we can't turn p-hacking into a legally justifiable way of allowing racial discrimination in important social processes like housing loan or employment applications.

> We can't make that discrimination legal

> an ultimately racial intent.

I reject your assertion that by studying data, data scientists or analysts are somehow complicit in discrimination or any -ism. Their job is to make business decisions based on data, WITHOUT relying on "gut feel" or other human biases. I also reject the expectation that data scientists have an obligation (or the ability) to "fix" whatever biases may be revealed by the data.

> p-hacking

Again, you are making assumptions that data scientists are somehow evil, engaging in shady or illegal tactics to promote discrimination rather than simply doing their job in a straightforward manner. P-hacking is often done by individual researchers looking to get a study published in a scholarly journal (as opposed to months/years or research failing to reach a statistically-valid conclusion), rarely by companies who are looking to make profitable decisions based on what the data is indicating.

It appears that you make the false assumption that the data itself are unbiased and are always factually correct. This is untrue. Data does not appear magically in a dataset. It is the interpretation of the world by humans and may therefore carry the original bias, intended or not, of humans. This is why analysts have to think about what the data means whenever they do their analyses. I would say that is their ethical responsibility.

> think about what the data means

of course they do that, that is literally their job.

But to reverse-engineer the biases that may or may not exist in an original data set -- please explain how this should be accomplished, because I don't see how someone could accurately quantify the amount or degree of race/sex/age/religion/nationality-ism without introducing additional "bias" based on that person's own opinion.

> Data does not appear magically in a dataset.

Right, so why isn't the boss, or exec, or department head, or 3rd-party, who sourced the data responsible for de-biasing the data before even handing it off to the the data scientist, so s/he can just do the job of data science-ing, and not political science-ing? You're putting a whole lot of "ethical responsibility" on just one person -- ironically, the one least likely to be good at interpreting human emotional tendencies -- within a much larger ecosystem.

That's the point, it is very difficult to correctly interpret analyses. That's why you don't take conclusions for granted, and work from the basis of what you think is biologically relevant. There is usually a whole phase preceding the actual analysis. You can visualise potential relationships in directed acyclic graphs, to try and see where bias might be introduced. However, that phase is very often skipped by medical researchers.

I never said the analyst was the sole person bearing the responsibility. You are right that just as much as responsibility must be expected from those that designed the information model, those that collected the data but also the person who ultimately analyses it and prepares it for whatever kind of dissemination. Everybody involved has to take their individual responsibility so that we achieve collective responsibility.

I never make any of those assumptions (notice how you have to quote sentence fragmets and even single words to set up your stra man...)

What I assume is that not all humans are perfectly rational individuals whose only goal is profit maximization. And thatwe cannot see peoples souls, so we need laws that err in the side of caution.

I also assume that homo economicus is explicitly disincentivized from reasoning about feedback effects and historical context, which are two things many actual homo sapiens care about (for obvious reasons).

E.g., disregarding feedback loops and the long arc of history, ie in a vaccuum, supporting racialized slavery is perfectly rational for non-enslaved people interested in profit maximization. Cinsider also a completely not raciat loan lender with vested interest in high property values who knows dark skinned people lower property values. Even if the data scientist is perfectly unbiased, bias and hatred in the underlying population can result in data driven, profit motivated decisions that harm marginal groups. The fact that this really actually happened en masse is WHY we have these laws...

Regarding the latter point, you are effectively dismissing all non-consequtialist ethics and associated legal traditions as "gut feelings". I submit that these gut feelings play an important role in human societies made up of irrational people. In fact, they are even important in societies with perfectly rational people who are not super reasoners with perfect forsight.

I just finished building a credit risk ML classifier in the company I work for. The model will be used to define if we lend or not money to people/companies.

Adding profiling features does give more accurate predictions. However, I pitched not using these features as a competitive advantage to the founders and they (luckily) agreed. We won’t be using them, and we ended up (with more work of course) getting a similar performant model without them.

We can and should try to not use those kinds of features and be as fair as possible.

Please explain. If you removed "profiling features" from your analysis, but then through other means ended up with a "similar performant model," then aren't you still profiling, just with less obvious data inputs?

If Group A is low credit risk and Group B is high credit risk, won't your new model still make it much harder for Group B to be approved? If not, it's not "similar." If you are able to dissect the Groups and pull individual high/low risks from within the groups, that would be a superior model, which is not what you claimed. So how are you not profiling, and how does it make a difference in terms of who gets approved/rejected?

We focused on behavioral data. We realized that a solid plan for the the money and how is it lay out, does correlate with a low default rate. We think this is a good way to avoid (or at least minimize) those kinds of proxies you mention.

So, people with "a solid plan for the money" (and who is the wise sage that makes THAT decision?) is just your company's replacement for directly using profiling information. You really think you've changed anything?

Please explain how making lending decisions based upon someone’s planning is somehow worse to you. Our preliminary tests show higher scores to people who by traditional means would have had a tougher time getting access to loans.

How do you measure how solid their plan for the money is?

It's illegal to use them. Although it's strictly suboptimal for the firm.

So you’ve replaced first-order proxies with higher-order proxies.

Sounds like it wouldn’t make any difference (other than PR).

There has been a rise in romantic thought pieces lately about how Data Scientists are wizards and can solve any problem with the real superpower of teamwork. (here's an older example from Instacart: https://tech.instacart.com/data-science-at-instacart-dabbd2d...)

In the real world, the state of affairs in Data Science is more practical and pragmatic. And there's nothing wrong with that.

Not worth reading. For example, this makes no sense:

"(2) decision science, which is about “taking data and using it to help a company make a decision”; and (3) machine learning, which is about “how can we take data science models and put them continuously into production."

Machine learning isn't about putting models into production. It's about machine learning models directly from data.

And if decision science is 'taking data and using it to help a company make a decision', then pretty much any job involves data science, e.g. the guy comparing quotes for paperclips and picking a vendor.

> Machine learning isn't about putting models into production. It's about machine learning models directly from data.

From a business perspective, the thing that's different about "machine learning" compared to other things you do with data is that it's possible to take the human out of the loop. That's a qualitative difference, as opposed to the quantitative difference of your business analysts giving better recommendations. We can quibble over terms, but as a broad stroke, things that are machine learning can do that and things that aren't machine learning cannot.

That qualitative difference is the main thrust of the quote you pulled, although it could be more explicit. Rather than the analyst building a model that tells him what shade of red is best for a button so that he can pass that information along to a design team, the button color is connected directly to the model.

The distinction, as you've restated it, still isn't useful:

'From a business perspective, the thing that's different about "machine learning" compared to other things you do with data is that it's possible to take the human out of the loop.'

There are many things you can do with data that take humans out of the loop, that don't involve machine learning. For example, software that automatically re-orders stock in a supermarket once stock (calculated based on starting stock less sales) goes below some level.

You could argue that this still has a human in the loop (to define a threshold) and that you're not removing the human from the loop until the thresholds themselves are automatically calculated.

But then you're just moving the job of the human from deciding the threshold, to deciding what % of the time it's acceptable to be out of stock of that item. Sure, you can automate that, too, but then the job of the human still exists: she's just deciding the objective function that stock-out percentage must satisfy, rather than deciding the stock-out percentage for each SKU directly using a jupyter notebook or Excel sheet.

I sincerely wish more people thought like this. Nothing is different about machine learning. It only performs better than OLS in a very specific subset of rich data, where improving prediction/action is important.

I think of OLS as just one type of machine learning.

OLS is great for many types of problem.

For others, other techniques massively outperform them in some way (e.g. CNNs for classifying camera images or spectrograms of audio data).

Even where OLS performs well, it seems other techniques can frequently do better.

Totally fair, I guess what I was getting at (poorly) was that OLs has been around for a long time, lots of hype for ML now, but there are plenty of techniques here that have been readily available.

Who are Data Scientists' heros? Seriously.

In AI, it's Hinton, Le Cun, Bengio.

In systems, it's D Richie, J Dean, Berners-Lee, Torvalds.

In distributed systems, it's Lampord, Chandy, J Dean.

In programming languages, it's D Richie, Gosling, Dijkstra, Knuth, Milner, etc.

Who are data scientists' heros or role models?

Hadley Wickham probably.

Also Andrew Gelman

Florence Nightangle, maybe? Her data viz is really impressive, like her 'coxcomb' diagram on mortality in the army.:


> She was also a pioneer in the graphical presentation of data. At a time when research reports were only beginning to include tables, Nightingale was using bar and pie charts, which were colour coded to highlight key points (eg, high mortality rates under certain conditions). Nightingale was keen not only to get the science right but also to make it comprehensible to lay people, especially the politicians and senior civil servants who made and administered the laws.


Nate Silver of 538

Hadley Wickham, Andrew Gelman, and maybe Pete Warden are who data scientists look up to, but I think Nate Silver is the first data scientist to be a household name (admittedly, NYT-reading upper middle class households, but still).

That's a correct description of 'data science' as it is understood around me, among social scientists.

Gelman, Tufte and Wickham are in the pantheon, sometimes followed by dataviz fellows like Cairo.

Never heard of Warden.

Silver would likely be described as a 'data journalist' rather than (or at least as often as) a data scientist.

Efron, Hastie, Tibshirani (basically Stanford stats).

Don't forget David Donoho. His paper "50 years of data science" provides a great historical context of data science and a vision of what "Greater Data Science" should be.


Also Tukey, Chambers, Cleveland (all three from Bell Labs), Breiman, Tufte.

I would also submit Judea Pearl to the list.

They are statisticians.

They wrote the books Elements of Statistical Learning and Introduction to Statistical Learning in R. Those books are about least squares regression, clustering, decision trees, random forests, boosting, additive models, support vector machines, etc.

All these are common statistical learning methods used in Data Science.

Also, if you read the fantastic computer age statistical inference from Efron & Hastie (it's available online!), you notice they are both fans of data-science! The whole book reads like a big argument why we need data-science and why traditional statistics is not always the answer.

This comes especially obvious in the epilogue, where they try to give a quick oververview how the concept of a "data-science" formed and how statistics diverged into data-science + ML and the traditional statistics-community.

They end the book by arguing that both communities should find to each other, because fundamentally they try to do similiar things. I also think is badly needed. Unfortuntaly I experience some of arrogance on both sides, which makes it harder! (DS/ML-people have no idea what they are doing and only throw their algorithms on problems & benchmark them! Statisticians are obsolete & i can just automate them with NNs!)

Wouldn't you consider AI / ML to be a specialization of data science?

> Lampord

Lamport, in case anyone's googling :)

After reading this article, I still have no idea.

I mean, it made it sound like data scientist is just the same as a business analyst? Is this the new computer scientist vs software engineer?

>I mean, it made it sound like data scientist is just the same as a business analyst?

I think this demonstrates how hard "titles" are, because a "business analyst", in the sense that I learned, is not at all like a data scientist (or data analyst):

"Business Analysis'' is a research discipline of identifying business needs and determining solutions to business problems. Solutions often include a software-systems development component, but may also consist of process improvement, organizational change or strategic planning and policy development."

Most BA work I've done involved translating business requirements into technical or software requirements.

In other words, who knows...

Hum, or maybe I meant data analyst, like you say who knows.

‘Data scientist” is what they call a statistician on the West Coast. — an East Coast statistician

This is true. I'm a Ph.d. statistician with a "data scientist" job position in San Francisco.

Data science lost all of its appeal to me when I spent a weekend diving in and found it was about 70% fidding with weights until you get the answer you want and 30% trying to figure out why the data was so wrong.

A friend of mine told me that in his company people write ETLs, send them to an external service for processing, and get that back - this is what they call "doing data science" :)

As a full stack engineer that knows very little about data science. what courses, libraries, etc are worth my time to explore? What should I be well versed in to be competent in the future?

A boring response, but have you studied stats? Knowing stats and a bit of SQL is enough to get you pretty far with a lot of problems. I'd consider those skills an important pre-requisite to more advanced tools and techniques.

I haven't, atleast not since my last university class. I'd love to find some kind of course that trains me on the basics along with applying it through programming.

Whatever you are using in your stack for data access / search probably has some associated capabilities that support data analysis, and hence data science. I would start there.

As a full stack engineer

I’m confused, if you know the full stack already don’t you already know all this? It’s all part of the stack after all.

“As a webdev...”

I do know programming. I know how to build a backend and frontend. What I don't know is how to take a large dataset and start making predictions based off of it.

Learn how to export a database to CSV, then learn excel inside and out, it’s scratching the surface but a good start.

Check out Kaggle.com and Coursera's intro to machine learning.

Since we're on the topic of "what data scientists really do", I have comments about these common recommendations:

1. Kaggle.com competitions incentivise chasing a metric, where very large amounts of man-hours are used to improve a model a tiny bit, which may not have dramatically different performance in a real-world situation than a baseline model. (Kaggle Kernels, which consist of Python/R Notebooks, are better for teaching how to analyze datasets: https://www.kaggle.com/kernels)

2. Andrew Ng's Coursera course is good, for an introduction to terminology only. You won't be using anything from the course in real-world use cases (e.g. hand-implementing the matrix algebra behind backpropagation), but it's good to learn what backpropagation is. And please don't say you're an expert in machine learning because you took the course.

while I agree with your second thought, I don't agree with your first. Kaggle is like a sport and therefore can't and shouldn't be comparable to real-life data-science (I mean, who want's to compete in tasks modeled after real life? :D It's often A LOT messier in rl!).

BUT I think kaggle still trains you in the right things. In my experience, the winners were almost always more creative and didn't win just through massive hyperparemeter grid-search. I also think that a lot of the usual non-ds struggles are just not as challenging.

If I would be in the position to hire a DS and theoretical skills would not be that important (maybe there's already somebody with strong theoretical skills in team), i would just hire somebody based on his kaggle profile. It will probably take some time till he adjusts to real-world DS, but I don't think it will be that of a challenge.

1 - Kaggle is a teaching tool to help amateurs work with real data and real problems. Chasing metrics is what data scientists do all day long, it’s a good resource.

As this submission and other top-level comments on this submission note, "chasing metrics" is not what data scientists do all day long.

What do they do?

Mostly connecting to data, cleaning it and finding some place to stash all of it.

Only after that 90% is done can anybody think about modeling data, transforming it, processing it and lastly that glorious 5% of actually analyzing it.

Oh, and then somebody wants the results of the analysis to be put into a fully interactive scalable web application so now we're late.

It is a hybrid of software engineering and stats.

Calling it science is a stretch. I can understand if you are solving problems in a traditional scientific field, but if you are doing economic modeling to manage investment risk and optimize profit for an internet company, it's hardly science. What a scam!

Science isn't a noble and pure endeavor. It's just a methodology to construct predictive power from information.

Science is more of an institution than what you are describing like some isolated act of making a prediction

Data Science attaining the scientific grade when ablation analysis becomes mandatory maybe?

I like the idea of ablation analysis, but when did fiddling with it until it changes become science?

Less black box, more reproduciblity / generalisation is what people ask for these days, so ablative studies exposing how the bricks in the model work individually? In old terms, sensitivity studies.

There are some quotes missing from this title ... the proper spelling is Data “Scientists”.

Their work can be perfectly scientific. My problem with it is is the redundancy — imagine someone claiming to be a "food chef."


Query Monkey

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact