Hacker News new | past | comments | ask | show | jobs | submit login
How to Become a Data Scientist, Part 2 (experfy.com)
184 points by iamjeff on Aug 7, 2016 | hide | past | favorite | 50 comments

I don't like the term 'Data Scientist' at all. I think its far to loaded. Its a bad thing. Many developers already building sophisticated analytics and predictive systems will avoid identifying with what they are really doing in fear of being challenged as to whether they are a 'real' scientist or not.

Yes. All scientific disciplines involve data, and while there is a field of studying "data" as a concept itself, this is most certainly not what these people at these companies do.

The accurate title is Data Analyst, something that has already existed for a long time and works just fine.

The problem is that the term data analyst has come to mean data reporter. Similarly "business analyst" generally involves tasks that are best solved in Excel. The "science" in data science is about testing predictions. But I agree data science is a terrible phrase.

Or business analyst, or quant in the world of banking/finance. Someone who calls themselves a "data scientist" lacks the awareness that people have been doing this stuff for decades, and therefore also lacks the body of knowledge that has built up over that time. Actually, anyone can download a couple of open source libraries and run the numbers through them, so where is the value-add of "data science"?

I never liked "software engineer," since most of what I saw was not engineering-like. I always called myself a software developer, regardless of any company's title system.

So, data developer?

I don't see a huge difference between "engineer" and "developer", at least in Europe there are degrees in Software Engineering (which were born from traditional Engineering curricula) which are taught by the University's "Faculty of Engineering", so the name is somewhat justified.

I actually would find weirder if someone would define him/herself as a "Computer Scientist", even with a CS degree, outside of Academia :)

While you create new software, you usually don't create data. But +1 on the "software engineer", prefer developer as well.

> you usually don't create data

I think the creation of new data is the whole point, actually. Even at the lowest level, aggregation of data is still new data. Deriving insight by the use of more sophisticated methods is also creating new data. Arguably, all data is derived so just because a data engineer distills large data sets into smaller ones doesn't mean they don't create new data.

> > you usually don't create data

> I think the creation of new data is the whole point, actually.

I had the same hesitation when I heard "data developer" in my head. But I agree with JoBrad. If you have a pile of dirt, and compress some of it, then you have a rock where there was none. If you take a rock and "disassemble" it, you have nuggets of gold where there was none.

Really, titles don't matter too much (which is why I used my own :), and we know generally what's being discussed when we hear data science and data scientist.

Yes, scientists do research, and even if you are doing experiments and analysis, if your results are not generalizable, then it's difficult to argue that's a 'research' pursuit. Applying established methods to a particular case is perhaps closer to engineering, or a more general term of 'analyst'.

Consider that all scientists are working with data, often very large and complex datasets, the term comes off like a joke.

"A data scientist is a statistician who lives in San Francisco." -- Josh Wills.

I think now alot of companies are using that term for what is basically a data analyst or perhaps statistical analyst role. Because it sounds impressive it's a way to attract people because we have egos and would like a title that is externally fascinating.. 'look dad, I am a scientist!' Not to say there aren't some doing impressive things or that there is not valid science done with data. What I often have seen is more like BI roles marketed with a new title though.

What about "data curator"? It's a term I came up with.

I prefer being called a Software Developer over Software Engineer or Programmer. For a long time, my work is more to do with developing solutions related to capture, parse, enrich and visualize huge volume of data. Data Developer sounds weird.

What do you guys feel?

In part 1/3, the author writes that there are 2 branches of data science:

> Data science for people (Type A), i.e. analytics to support evidence-based decision making

> Data science for software (Type B), for example: recommender systems as we see in Netflix and Spotify

Isn't "type A" business intelligence, and isn't "type B" machine learning? Why doesn't he use those more widely known terms? Or maybe he is referring to something else?

Both business intelligence and machine learning are narrower terms with a more specific meaning. Business intelligence has the connotation of a certain set of techniques (e.g. reporting, SQL, querying, PowerPoint presentations) while machine learning is a fairly specific set of tools that is much narrower than what most software data scientists do. E.g. I'm a software data scientist and I spend much more time on descriptive statistics than machine learning.

Yes, I agree that both ML and BI are narrower definitions compared to that of data science.

My current job title is "Data Engineer" I'm not sure how that differs from a "Data Scientist".

My Job basically aligns with Type A above. I mostly work on optimizing our industrial process through a combination of modelling and simulation work. Other than that I do quality/defect investigations when we have issues with defective batches etc and I do yield optimization work. I also oversee various plant trials as required.

I use Business Intelligence software (like SAS and COGNOS) but I use other tools as well (including my own C/C++ code). I lean heavily on my own theoretical knowledge - in particular metallurgy and minerals processing. (I am a Materials Engineer by qualification). I think most BI people would lack the background theory to do my role.

My Job title was more or less arbitrarily chosen by my manager (Other people in my team have 'Automation Engineer', 'Process Engineer' etc. as their role titles). I consider myself an Engineer above anything else.

That seems like a pretty inaccurate job title then. Data Engineers are people working with data pipelines, storage, and schemas. They can lean towards more software engineering or towards analytics with dashboarding/machine learning but their primary responsibilities are the former.

Yeah it is a pretty generic title. Regardless it fits my job description pretty well I am an Engineer I work with data.

The people who do the work you mention (schemas etc) are called Database Admins here. They work for Information Services - different department I'm not all that familiar with.

The distinction is being made on end outputs -- is it more like a Powerpoint deck, or more like a production data pipeline -- not on techniques.

"Machine learning" includes plenty of activities that can be used to provide evidence for one-off human decision making (e.g., using a model to produce forecasts or to understand sensitivities).

Mmmmh yes, often ML produces data which supports human decision making; I see what you mean. In fact, now that you mention it, in my company the BI team is starting to use ML techniques to recognise some particular patterns.

I liked this series and this part. I think it's important for people using data science in the industry to continue giving insight into best practices, feedback to academic programs, and occasional insights into the problem applications. In my mind, this ultimately improves the quality, education, and marketability of data science.

I discovered the series earlier today on HN and the discovery could not have been any timely-er. I am just about to embark on the first six to eight months of a learning journey and see immediate utility in insightful series such as this one. I also came across a really helpful post that gives recommendations on progress markers for the self-taught developer [A Better Way to Learn Programming? Notes on The Odin Project;http://everydayutilitarian.com/essays/notes-on-the-odin-proj...]. Guides like these, while they take a lot of time to write and refine, are complete lifesavers for entry-level professionals and prospective practitioners (and especially if they come from professionals that have been "tried and tested").

Guides like these inspire me to quit being lazy and get back to writing one!

You are a data wrangler? Perhaps a guide would do; noobs like me have no perspective on what to learn and how to learn it. I mean, it wasn't even a year ago that I was convinced that I could go from 0 to 100 data science-wise in under a year: I wanted to learn it all. It took me the better part of a year to realize that I had wasted innumerable hours devising a curriculum and timelines that were plain dumb. A practical guide could have spared me a lot of hurt and while I cannot at this moment compensate you (or the community for that matter), I am sure that opportunities will certainly arise for me to pay my debt. Would love to see a guide from you- it would come with the added advantage that you would be accessible to the brilliant HN community. I would give away a limb to see such a discussion go down: what to learn and where to learn it from (a lot of folks, I imagine, would not mind recommendations for openly accessible material; I know I wouldn't mind that)? how fast should you expect to go/move/learn? time commitments? tools and frameworks? motivation hacks? where would I go to find remote jobs? what level of proficiency should I achieve in the first sprint?

The best way to learn how to wrangle data is practice, especially outside of academic settings, where the example data is not necessarily reflective of real-world data.

Helping others wrangle data is one of the reasons I publish my Jupyter notebooks open-sourced. A few examples my data wrangling with R:

Processing Stack Overflow Developer data: https://github.com/minimaxir/stack-overflow-survey/blob/mast...

Identifying related Reddit Subreddits: https://github.com/minimaxir/subreddit-related/blob/master/f...

Determining correlation between genders of lead actors of movies on box office revenue: https://github.com/minimaxir/movie-gender/blob/master/movie_...


I'd add that Kaggle is very good for the "other end" of data science: they generally have pretty clean data, and clear problem descriptions.

In real life the data is never clean and the problems are rarely known in advance.

We also are growing https://www.kaggle.com/datasets, which won't necessarily have clean data, clear problem statements, and a well-defined task.

The "Analysis of Lead Gender and Box Office" project put a smile in my face because it looks like so much fun- that is something that I would like to know how to do in the coming months. Thank you for open sourcing the notebooks and the recommendation to practice then demo on real data.

COI: author is a "data science" recruiter and the field has not coalesced down to a static definition. Caveat lector

In the article Alec mentions it is important to be able to read academic papers properly. Does anyone have good resources for this? I've read some papers before but do not have a research/academic background where I really had to dig deeply into them.

I'm not sure there's a good answer to this.

The best I can say is to go to grad school. That's a terrible answer, but it's perhaps the only realistic one. It's in that situation, or one very similar, where you're exposed to loads of criticism and discussion. Basically any paper that was competently written (even if it wasn't competent work) is going to sound convincing to the naive. After hearing a few papers get torn down, you'll see the cracks in weak arguments, the poorly supported conclusions, and the seemingly boring stuff that's absolutely brilliant.

Very generally, the best sign of good work is a 'masochistic' author. What I mean by this is an author that writes as though every result they get is deeply suspect and needs to be corroborated in multiple ways. When it's almost exhausting to read because it feels like they're just beating themselves up, you're probably reading something really special.

Likely the most difficult thing to do as an 'outsider' is to get a sense of how 'trustworthy' certain results are. Some methods are almost binary in that you either get no result, or a great result. If an author shows this, there might be very little reason to doubt it, and thus independent lines of evidence not really necessary, especially if there's context that supports / is consistent with that result. Other methods are notoriously terrible and need a great deal of careful controls and analysis to even be considered, and then only as one angle of attack. Sometimes you can find reviews that discuss methods like this, which would be an invaluable resource. Reviews are generally a great way to start reading a field, anyhow.


As a practical guide, a well written paper can just be read start to finish. Then reflect on it to see if you understand it. Could you explain the paper to someone else? That's a good sign of whether you understand it. After that, think of critiques. Could the results be interpreted different ways? Was the analysis appropriate for the data? Are the methods reliable? All papers have weakness; we live in a world of finite time and resources. All papers could be better, so think about what could be done. After that, consider what would be reasonable to do. Did the authors skip something conspicuous? That's a good sign that there was some difficulty there were avoiding. That might be fine, but it also might mean there's data that doesn't fit with their conclusions, which would be a very big issue indeed.

That latter part is the most important, but also the most difficult to do. It requires reading dozens, really hundreds, of papers so that you learn about some 'unknown unknowns'. Hearing talks really helps with this, too, as many people will give a sort of history of their work that includes some of the twists and dead ends.


That all said, anyone can read a paper. It's not 'magic' that lets you do it. You'll miss some of the nuance, and occasionally be lead astray, but peer review works reasonably well enough that papers are mostly quite good with the devil held to the details. Like most things, it likely follows the Pareto Principle, with a little effort bearing outsize results.


We detached this subthread from https://news.ycombinator.com/item?id=12243816 and marked it off-topic.

It is valid to downvote on the manner in which something is said if it's not civil.

In this case, not only is ALL CAPS utilized, it hits the No True Scotsman fallacy.

Unless someone uses pejorative adjectives I don't see how we can consider a comment uncivil, had it been in real life then yes the tone of saying would've given that up but I rather focus on the point raised than dig into the text.

Invoking the No True Scotsman fallacy is not a good defense for the No True Scotsman fallacy.

BS in Statistics? If you don't have one you are very likely not a data SCIENTIST. You will be data guy.

I know that some meteorologist have used normality in forcasting. This is an example of why you cannot become a data SCIENTIST. Another example; applying regression to your data. If you think that regression is as simple as its formula then you need at least 4 years to understand what I mean.

Why should not having a BS in statistics prevent anyone from properly learning and applying statistics in a rigorous scientific fashion?

There are a lot of people with a rigorous mathematical background (mathematicians, physicists, biologists, computer scientists, ...) who are perfectly capable of understanding and applying stats concepts at a high level. In addition, these people have a lot of experience with doing scientific research, so shouldn't they be even more qualified to call themselves "data scientists"?

Can you give an example of something that clearly distinguishes a "data scientist" from say a physicist who learned regression from a stats textbook?

there are a lot of gotchas that don't seem like errors but completely invalidate analysis when done without a thorough understanding of a technique.

For example you can learn regression from a stats textbook but unless you've gone through a thorough (and painful) graduate-level stats course, you probably haven't seen the edge cases that invalidate assumptions and necessitate a more complex regression e.g. your regression may suggest there is no effect but when you look at the residuals, you may find systematic bias that you can model using a subject-specific random effect or some transformation as a generalized linear model...

That isn't to say you need a graduate level stats degree but applying statistics without understanding the pitfalls can lead to seriously wrong conclusions.

That's a good point, but I suspect that a lot of serious gotchas that a data scientist might encounter in the wild are not taught as part of a graduate-level statistics course. Being able to think critically and quickly adapting to the problem at hand might end up being more important than previous experience in stats (which is still very valuable, of course).

Does the word scientist mean something different if it is in all caps?

I mean, I get it. You would like it for the word to remain some pure version of meaning that it actually never had. Similar to getting upset that people using the word literally in a very figurative sense.

The relevant question on this style of article is not about word smithing.

That said, I'm in a field where we often through the title engineer on people, but we don't know why.

I don't really want to weigh in on a semantic argument, but I was considering what would be a good definition for a 'scientist' vocation.

To me, a scientist is someone who engages in research with generalizable results.

That would exclude someone who does experiments and analysis, but only applies established methods to a particular problem. Call them an analyst, perhaps, but science is an ongoing dialog that they are not participating in.

I think that's part of why many consider it really presumptuous for 'data scientists' to call themselves such. Some of them are certainly developing new methods and engaging in a kind of dialog that is definitely science. Others are addressing business needs with a new sort of analysis, drawing on the field but not giving back to it. There is absolutely nothing wrong with that, but it does seem like that pursuit is different enough to bother calling it something else.

Historically, I think this would exclude a lot of well known scientists. Consider, many of them were running experiments to find an answer. Could have been aiming cannon balls or planting crops. Or, in the case of many of them, looking for a way to transmute matter. :)

More recently, a lot of scientists worked to figure out atomic energy. Most, in likelihood, we do not know the names of anymore.

I think of it as musicians. If you define musician to be rockstar, there are not nearly as many as if you include school teachers, symphony players, conductors, etc. Yet, for most people, this later set would definitely be considered a musician.

Get your point, but actually "literally" did actually used to have a different meaning. Only recently have dictionaries accepted the nonliteral use as a form of emphasis. In fact Cambridge still disallows it for formal speech.

Define recently. :) http://www.merriam-webster.com/words-at-play/misuse-of-liter...

Now... I deserve to be literally roasted for some of the other mistakes above. I should know throw versus through... Not even pronounced the same...

Saying you can't be a data scientist without a Bachelor degree in statistics is like saying you can't be a software engineer without a Master's degree in software engineering. Your educational system is not really a good environment for learning, so I'm pretty sure someone can become a data scientist with a French "License d'Histoire Moderne" and a lot of work.

Oh yeah, there was a miss calculation of a financial information and I guess because of this stocks have fell. Why? Because some smart guys used Excel.

Or, then I can become a doctor right? By studying hard at home?

You should not employ a self thought software engineer or statistician or a doctor in critical positions such as health, finance, engineering, and science. Some might be okay but 99.9% of them will be incompetent for the work.

I appreciate that you hedged by saying "are very likely not" instead of just saying "not".

See, here's the deal. Numerical literacy (numeracy) comes about through many processes, one of which is obtaining a statistics degree.

However, not many fresh statisticians could tell you or me the expected runtime of a taking a numerical field's median on a large database. Why? Because the backgrounds are different. Yet, for a solid data science team, you'd like at least someone who can give insight as to whether your project will take minutes or decades to complete. This (contrived but important) example suggests that while a BS in statistics is an important and necessary component of a data science team, it not sufficient.

It was a statistics class where I learned a linear time median algorithm. But that was pretty lucky, I admit.

95% of undergraduate statistics education is focused on formal inference. Data science, in my experience, involves a lot more exploratory data analysis [1] than formal inference (frequentist or Bayesian).

The extreme focus on inference and the hypothesis testing step in the scientific method is something people with a formal statistics education have to overcome to be productive data scientists. Or applied statisticians, really! It is more important to understand the data, organize it creatively, and find unexpected structure.

[1] https://en.wikipedia.org/wiki/Exploratory_data_analysis

Being a scientist is a mindset and someone that follows a certain approach.

Existing institutions do not have a monopoly on who is a scientist.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact