Hacker News new | past | comments | ask | show | jobs | submit login

These types of posts validate my concern about the people entering my field right now.

Data science, as a line of work, is distinct from other technical roles in its focus on creating business value using machine learning and statistics. This quality is easily observed in the most successful data scientists I've worked with (whether at unicorn startups, big companies like my current employer, or "mission-driven" companies).

Implicit in this definition is avoiding the destruction of business value by misapplying ML/statistics. In that sense, I am concerned about blog posts like these (which list 50 libraries and zero textbooks or papers) and those who comment arguing the relevance of "real math" in the era of computers.

Speaking bluntly: if you are a "data scientist" that can't derive a posterior distribution or explain the architecture of a neural network in rigorous detail, you're only going to solve easy problems amenable to black-box approaches. This is code for "toss things into pandas and throw sklearn at it". I would look for a separate line of work.




I think the "Data Scientist" job title is overloaded–I see several clusters of skills being useful, and in my ideal world they would have similar but slightly different job titles:

–Medium Stats/ML, medium Engineering ("Data Scientist" or "Data Engineer")

–High Engineering on very large datasets, low/medium Stats/ML ("Data Engineer" or "Backend Engineer")

–High Analysis, medium Stats/ML, low Engineering ("Analyst")

–High traditional Stats, High Analysis, low ML/Engineering ("Statistician")

–High ML, medium Stats, medium Analysis ("Data Scientist")

–High ML, medium Engineering ("Machine Learning Engineer")


One of the lessons of the web (in the 1990s everyone was a webmaster until the field mature.) is that after the coders, specialists emerged in fields like design, management, UX, seo and content. For data science the most obvious is data visualization but I guess there's plenty of new jobs ahead in addition to core data science jobs.


the second most obvious is cleaner. 90% of time spent when i try to do any analysis is just spent trying to get data that are usable, or in an usable shape.

And let's not talk about the different CSV you can encounter... or how to get data out of databases that are literally the only reason some "guardian DBA" still have a job. It can take months to get any access, if you ever get it...


You know, I really should add a post soon about algorithms, papers, and textbooks. You make an important point which the first responder highlighted, "avoiding the destruction of business value by misapplying ML/statistics."

I understand the math behind what I do, but it's not a fair assumption to think that everyone reading my post will be motivated to pick up and understand the math before they start applying the tools.

Especially with tools like scikit-learn and orange, it's especially easy to misapply ML and statistics or simply approach a problem without understanding the tools and come out with something that looks plausible to the untrained eye.

Key to the reason that you should understand your tools, including the math that underlies them, is that you should be able to look at the results of your work and know if there's something "off". And beyond that the underlying understanding of the math involved gives you the tools you need to debug.


I propose you can basically monte carlo yourself to a decent understanding.

The disadvantage is: You never know you are right for sure, plus there is extra time spent on applying your experience to each new type of problem.

The advantage is: You can easier relax assumptions once it is set up, and learned heuristics to deal with new problems quicker than the perfect way.


Or, just like software engineering or any other profession in the world, there's going to be a need for people to solve hard problems and people to solve easy problems. Data science isn't different.


Yeah, that's fair!


> Implicit in this definition is avoiding the destruction of business value by misapplying ML/statistics

This is an incredibly important point.

I'm working as a fundraising and marketing analyst for a non-profit, but my background is in biology. The skill-set needed for analysis is pretty similar between marketing and population ecology. If you ask someone in either field what the biggest barrier to analysis is, getting data would almost certainly be the most common answer for both fields. However, data is treated very differently between the two fields.

On the scientific side, I find that most of the frustration occurs because there isn't enough data to make a conclusion. Peers will criticize conclusions made with insufficient information.

On the business side, I find that I'm often pressured to make claims that are much more confident that the data is capable of being. As a scientist, I am always very aware of the limitations of my data, but in business I feel like I'm pressured to make conclusions, and that people are waiting to make decisions based on any information they can get out of me.

I spend more time on my write-ups than I do planning my experiments, collecting data, and performing my analysis combined. In a business setting time "moves faster" and the stakeholders in a project expect results no matter what. In these cases, communicating what the limitations are in a concrete way is really important. Expressing risk in terms of money, or probability in terms of coin-flips makes a pretty substantial difference, and can really help people relate to the information you are presenting.


Speaking as a business person: often the biggest challenge is to make ANY decision and actually DO something. The perfect is the enemy of the good. So to continue the cliches the business critique of your objections would be "analysis paralysis."

I tell you this just to help you understand what you describe. But in my observations of failure modes in business, it is rarely because one follows the wrong analysis, but more because most are unwilling to make any changes unless confronted with overwhelming evidence. (And that hurdle always gets higher no matter how much evidence you give.)


>most are unwilling to make any changes unless confronted with overwhelming evidence.

That's probably the second most common problem. I'd say 80% of my job is just fighting confirmation bias. So if someone thinks something needs to be changed, they'll take any sign that it should be changed. If someone thinks something should stay the same way, they'd argue with god about it.

I probably propose changes more often than I propose keeping things the same way, if only because testing an idea and gathering information requires making a change somewhere. I have a lot of conversations with people who are pressuring me to make a conclusion that the current way is best as soon as possible, so they can throw a lot of money at their pet project.

I'd say that most of the claims I'm being asked to make with limited evidence would be supporting the status quo, which is in line with your assessment.


and those who comment arguing the relevance of "real math" in the era of computers.

Is this related to my comment? I used "age of computers", but close enough. It's really not a fair representation of what I said at all.

I stressed the importance of knowing theorems and deriving proofs - arguably "realer" math than learning an equation by rote. I did some applied maths in undergrad, and in my experience a lot of my time was devoted to solving large and complex equations using fairly mechanical rules, and comparatively little of my time was spent on axioms and proofs. I wonder whether this focus is justified in the age of computers - might we derive the complex formulas just once or twice as an exercise, and not step through them ourselves again and again? Might we focus more on what the computer can't do well for us - rigour and intuition?


> Is this related to my comment?

It was initially related, yeah, but I realized I had uncharitably read your point. I edited my comment, but not enough. Sorry about that.

To be fair, this point is often raised in these threads as "why do math when computers do it for us?" so the criticism wasn't specifically levied against you.

We agree that repeated derivation when working on a new problem can be useless. It would be silly to work out OLS assumptions from first principles upon any import of sklearn.linear_model! I believe understanding those assumptions, though, or (say) how backpropagation works is important, since (1) it can help you debug issues and (2) explain modifications to the core models (GLMs or LSTMs, in the above examples).


Part of the issue as I see it (for me, unrelated to the article), is that companies are willing to use the data scientist term for positions that need none of the rigor you mention. However, the people were hired and are now called a data scientist.

The same type of thing seems to happen in other fields, too. Software engineers who don't engineer, data scientists who don't 'science', project managers who don't manage. Are they top in their field? No, they somehow have a job with the title though and so far have managed to not become unemployable. Do they care if they are rigorous in what their title is expected to be by top practitioners? Probably not, they get paid still and have the title, and can probably get hired at the next similar place.

Kind of sad that these positions may 'cheapen' the title, so what can be done about that? Not much I guess, since companies can use position titles as they'd like it seems...


In my (admittedly short) experience as a data scientist, "solving the wrong problem"/"working on irrelevant things" and "inadequately cleaned/prepped training data" are vastly, overwhelmingly more common failure modes than "building the right thing with good data inputs but misunderstanding the algos." Probably more common by an order of magnitude or two.

Then again, maybe I'm just working at companies with problems that are amenable to easily-understood algos but have plenty of data-and-product-themed problems.


Great points about issues faced before dealing with models/algos directly. Understanding something about the models/algos can help guide data prep too.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: