Hacker News new | past | comments | ask | show | jobs | submit login
Being a Data Scientist: My Experience and Toolset (jeffersonheard.github.io)
168 points by jeffheard on Jan 17, 2017 | hide | past | web | favorite | 48 comments



These types of posts validate my concern about the people entering my field right now.

Data science, as a line of work, is distinct from other technical roles in its focus on creating business value using machine learning and statistics. This quality is easily observed in the most successful data scientists I've worked with (whether at unicorn startups, big companies like my current employer, or "mission-driven" companies).

Implicit in this definition is avoiding the destruction of business value by misapplying ML/statistics. In that sense, I am concerned about blog posts like these (which list 50 libraries and zero textbooks or papers) and those who comment arguing the relevance of "real math" in the era of computers.

Speaking bluntly: if you are a "data scientist" that can't derive a posterior distribution or explain the architecture of a neural network in rigorous detail, you're only going to solve easy problems amenable to black-box approaches. This is code for "toss things into pandas and throw sklearn at it". I would look for a separate line of work.


I think the "Data Scientist" job title is overloaded–I see several clusters of skills being useful, and in my ideal world they would have similar but slightly different job titles:

–Medium Stats/ML, medium Engineering ("Data Scientist" or "Data Engineer")

–High Engineering on very large datasets, low/medium Stats/ML ("Data Engineer" or "Backend Engineer")

–High Analysis, medium Stats/ML, low Engineering ("Analyst")

–High traditional Stats, High Analysis, low ML/Engineering ("Statistician")

–High ML, medium Stats, medium Analysis ("Data Scientist")

–High ML, medium Engineering ("Machine Learning Engineer")


One of the lessons of the web (in the 1990s everyone was a webmaster until the field mature.) is that after the coders, specialists emerged in fields like design, management, UX, seo and content. For data science the most obvious is data visualization but I guess there's plenty of new jobs ahead in addition to core data science jobs.


the second most obvious is cleaner. 90% of time spent when i try to do any analysis is just spent trying to get data that are usable, or in an usable shape.

And let's not talk about the different CSV you can encounter... or how to get data out of databases that are literally the only reason some "guardian DBA" still have a job. It can take months to get any access, if you ever get it...


You know, I really should add a post soon about algorithms, papers, and textbooks. You make an important point which the first responder highlighted, "avoiding the destruction of business value by misapplying ML/statistics."

I understand the math behind what I do, but it's not a fair assumption to think that everyone reading my post will be motivated to pick up and understand the math before they start applying the tools.

Especially with tools like scikit-learn and orange, it's especially easy to misapply ML and statistics or simply approach a problem without understanding the tools and come out with something that looks plausible to the untrained eye.

Key to the reason that you should understand your tools, including the math that underlies them, is that you should be able to look at the results of your work and know if there's something "off". And beyond that the underlying understanding of the math involved gives you the tools you need to debug.


I propose you can basically monte carlo yourself to a decent understanding.

The disadvantage is: You never know you are right for sure, plus there is extra time spent on applying your experience to each new type of problem.

The advantage is: You can easier relax assumptions once it is set up, and learned heuristics to deal with new problems quicker than the perfect way.


Or, just like software engineering or any other profession in the world, there's going to be a need for people to solve hard problems and people to solve easy problems. Data science isn't different.


Yeah, that's fair!


> Implicit in this definition is avoiding the destruction of business value by misapplying ML/statistics

This is an incredibly important point.

I'm working as a fundraising and marketing analyst for a non-profit, but my background is in biology. The skill-set needed for analysis is pretty similar between marketing and population ecology. If you ask someone in either field what the biggest barrier to analysis is, getting data would almost certainly be the most common answer for both fields. However, data is treated very differently between the two fields.

On the scientific side, I find that most of the frustration occurs because there isn't enough data to make a conclusion. Peers will criticize conclusions made with insufficient information.

On the business side, I find that I'm often pressured to make claims that are much more confident that the data is capable of being. As a scientist, I am always very aware of the limitations of my data, but in business I feel like I'm pressured to make conclusions, and that people are waiting to make decisions based on any information they can get out of me.

I spend more time on my write-ups than I do planning my experiments, collecting data, and performing my analysis combined. In a business setting time "moves faster" and the stakeholders in a project expect results no matter what. In these cases, communicating what the limitations are in a concrete way is really important. Expressing risk in terms of money, or probability in terms of coin-flips makes a pretty substantial difference, and can really help people relate to the information you are presenting.


Speaking as a business person: often the biggest challenge is to make ANY decision and actually DO something. The perfect is the enemy of the good. So to continue the cliches the business critique of your objections would be "analysis paralysis."

I tell you this just to help you understand what you describe. But in my observations of failure modes in business, it is rarely because one follows the wrong analysis, but more because most are unwilling to make any changes unless confronted with overwhelming evidence. (And that hurdle always gets higher no matter how much evidence you give.)


>most are unwilling to make any changes unless confronted with overwhelming evidence.

That's probably the second most common problem. I'd say 80% of my job is just fighting confirmation bias. So if someone thinks something needs to be changed, they'll take any sign that it should be changed. If someone thinks something should stay the same way, they'd argue with god about it.

I probably propose changes more often than I propose keeping things the same way, if only because testing an idea and gathering information requires making a change somewhere. I have a lot of conversations with people who are pressuring me to make a conclusion that the current way is best as soon as possible, so they can throw a lot of money at their pet project.

I'd say that most of the claims I'm being asked to make with limited evidence would be supporting the status quo, which is in line with your assessment.


and those who comment arguing the relevance of "real math" in the era of computers.

Is this related to my comment? I used "age of computers", but close enough. It's really not a fair representation of what I said at all.

I stressed the importance of knowing theorems and deriving proofs - arguably "realer" math than learning an equation by rote. I did some applied maths in undergrad, and in my experience a lot of my time was devoted to solving large and complex equations using fairly mechanical rules, and comparatively little of my time was spent on axioms and proofs. I wonder whether this focus is justified in the age of computers - might we derive the complex formulas just once or twice as an exercise, and not step through them ourselves again and again? Might we focus more on what the computer can't do well for us - rigour and intuition?


> Is this related to my comment?

It was initially related, yeah, but I realized I had uncharitably read your point. I edited my comment, but not enough. Sorry about that.

To be fair, this point is often raised in these threads as "why do math when computers do it for us?" so the criticism wasn't specifically levied against you.

We agree that repeated derivation when working on a new problem can be useless. It would be silly to work out OLS assumptions from first principles upon any import of sklearn.linear_model! I believe understanding those assumptions, though, or (say) how backpropagation works is important, since (1) it can help you debug issues and (2) explain modifications to the core models (GLMs or LSTMs, in the above examples).


Part of the issue as I see it (for me, unrelated to the article), is that companies are willing to use the data scientist term for positions that need none of the rigor you mention. However, the people were hired and are now called a data scientist.

The same type of thing seems to happen in other fields, too. Software engineers who don't engineer, data scientists who don't 'science', project managers who don't manage. Are they top in their field? No, they somehow have a job with the title though and so far have managed to not become unemployable. Do they care if they are rigorous in what their title is expected to be by top practitioners? Probably not, they get paid still and have the title, and can probably get hired at the next similar place.

Kind of sad that these positions may 'cheapen' the title, so what can be done about that? Not much I guess, since companies can use position titles as they'd like it seems...


In my (admittedly short) experience as a data scientist, "solving the wrong problem"/"working on irrelevant things" and "inadequately cleaned/prepped training data" are vastly, overwhelmingly more common failure modes than "building the right thing with good data inputs but misunderstanding the algos." Probably more common by an order of magnitude or two.

Then again, maybe I'm just working at companies with problems that are amenable to easily-understood algos but have plenty of data-and-product-themed problems.


Great points about issues faced before dealing with models/algos directly. Understanding something about the models/algos can help guide data prep too.


The roles of statistician and data scientist are not substitutes but more like complements. This guy definitely is a data scientist. Here's some ways to tell:

- Works on non-mission-critical components, e.g. he's not doing statistics for the when the wing will fall off your airplane, but he can help you figure out business problems more open to interpretation, e.g. subject line open rates.

- His publishing tools favor flair over convention, e.g. Ctrl+f for "latex" has zero results, but he does have D3, C3, Bokeh, surprisingly no tableau.

- Not sure he even references a single classical statistics package. The vast majority of people publishing in social sciences or "old school" life sciences are using Minitab, JMP, R, or SAS (correct me if I'm wrong, please, it's an outsider's perspective).

This skillset is not inherently "cutting edge!"- or deceptively "all talk, no walk". They really are completely different roles, that use some of the same tools and formulas and jargon. To cut to the heart of it: When a company builds a plane and says "I wonder how unlikely it would be for the wing to fall off?" that creates the demand for a statistician. When a company is trying to out-compete others, or maximize profit/charitable-effectiveness, often in a service or a field that is heavily influenced with human psychology, that creates the potential for a data scientist to add value.


I knew I was forgetting packages. I do in fact use Tableau. Will add it. Thanks for the catch!

As for LaTeX, it would have never occurred to me to add it. I have no idea why not, but it doesn't. Maybe because it feels more like a chore than a tool. It's like an anti-tool. I mean, I do or did in the recent past use LaTeX, but in more recent years I would farm that out to someone junior to me who hadn't worked with it for long enough to prefer pouring bleach in their ears to being faced with tweaking one more broken LaTeX template.

I probably should include classical stats packages. They really should go in here. But I've been coding since I was a kid and typically eschewed classical stats and math packages because of my perception that they were slow walled-gardens, and that as soon as I had a method figured out in Matlab or SPSS I'd end up rewriting it in C, C++, or Java to make it work with other things or at scale. That was hammered home in the first company I worked with where we did modeling in SAS and then rewrote every model in Java because SAS couldn't keep up.

I'm not suggesting that classical stats packages aren't data scientists tools. I think they are. They're just not my tools because of the curious niche I found myself in.


I think my job is similar to yours. My background is in engineering at an industrial manufacturing plant.

I have some of the same issues. The Engineers here tend to reach for spreadsheets first (or Access databases - these things are everywhere at my work) and inevitably they run into scaling problems and end up with a huge bloated mess. I step in to re-architecture these monstrosities (using "real" databases when necessary).

The other big part of my day to day work is modelling and data analysis. Usually regression based stuff and LP optimization problems (SAS is very good for this) especially around yield and quality control. The venerable excel "solver" plugin is often abused very heavily by engineers and is not always the ideal solution.

The person who I took over from was a Stats guy and the original job title was "Process Statistician" my boss has since retitled my role "Data Management Engineer". I still think of myself as an engineer first and foremost and a "data" person second.

I use SAS heavily. We have kind of gone in the opposite direction to you. I have rewritten some of our models in the past from C++ into SAS mostly for ease of maintenance because SAS is better understood by the non programmers (Most of the Engineers here do not have a programming/CS background and those that do tend to either know Fortran or Visual Basic very few grasp C/C++ very well). Speed is not really any issue but opaqueness and ease of maintanece is.

I'd like to learn R because I have heard it is very similar to SAS but more transferable to outside companies. Julia is the other language I've got my eye on I have heard it is somewhat similar to MATLAB which is used for some modelling work here.


sometimes i write python packages to auto populate tex files. like imagine running LDA with 50 topics and showing how each topic (via word cloud) correlates to an outcome variable

then it starts to become a tool :)


Cassandra is mentioned, I agree it's great for storing metadata and can be used to build efficient graph implementations but it's cited for Graphs and Relationships? I think that can be misleading as Cassandra is a a distributed column based key-value store.


I noticed that too. I don't want to gainsay the author's experiences, but it sounds like the author is describing the job of a data analyst who happens to dabble with various software. I don't get the sense the author has in-depth knowledge about the tools he lists.

Also, I don't know about putting Mongo and Cassandra under "Tools for working with unusual datasets".


Am I the only one who came here looking for someone's experience as a tool set? For a second there I thought I might have stumbled over real honesty, a rare treat these days. Maybe, if we stop putting each other in stupid labeled boxes to please our bullshit peddling masters, we would get somewhere...


From the article:

> Machine learning and data mining are not well distinguished, but machine learning techniques increasingly favor “unsupervised” learning algorithms.

The statement above puzzles me because it does not align with what I can see in the news. Maybe I'm just uninformed, so please let me know if I'm wrong.

According to what I can read in the news:

1 - Almost all of the recent ML developments that I can think of are in the field of supervised learning / reinforcement learning

2 - the only field that I can think of where unsupervised learning techniques are prevalent is data mining, which is precisely why I see it as a very specific field.

Am I missing something?


No, you're right. Nothing about this blog post/resume inspires confidence.


Big Data is when you outgrow Excel.


'Data scientist' is just title inflation by statisticians.


Some say [0] it's title deflation for statisticians.

[0] http://bactra.org/weblog/925.html


"Statisticians" taught everyone NHST, and relegated bayesian probability to the appendix for decades. Once you realize what has happened there, you will view that title with very little respect.

I am glad to see machine learning, ai, "data science", whatever, grow as a separate field. The statistics programs had their chance.


There are cases where this may be the case, but did you look at the tools in the blog post? Can statisticians be expected to write mongoDB code, create a web scraper, and make interactive visualizations in D3?

Title inflation exists, but there is a real-world role here that isn't really captured by "statistician" at all.


If you're in a statistics program you're going to learn to code. That's been my experience anyway.


I think it's great that students and young professors in the sciences are taught to code now. I've even taught some of them.

To me, data science is more than understanding statistics, it's been essential to know how to scale them up and out.

If you're a domain scientist, you won't necessarily learn how to write reusable tools that are performant (or runnable) on data that is different from your initial model data. I once worked with a group whose model had grown so unwieldy that their config file was in NetCDF.

I found my niche was often in doing things that were slightly (or completely) outside the comfort zone of most domain scientists who were competent coders themselves, but who didn't have the funded time nor the inclination to learn things like database, visualization, and networking technologies that became necessary either to share their work with other research groups or to operate on larger datasets.

One project had me take a big model that was normally run twice a day and on a 4km grid and help write something that could run and visualize the results of the same thing on a 0.5km grid over a larger area and hourly. And then devise something that could help them visually explore the timeseries as it evolved, sometimes over months.

Designing the pipeline that can handle that is outside the scope of most scientists, even the ones who are good coders.


That line you're talking about sounds more like the traditional science/engineering divide. Maybe staticians are data scientists, but what we call "data science" is really data engineering?


Actually, I’ve noticed a meaningful distinction between people who learned statistics from machine learning (and are more likely to call each other data scientist) and statisticians (the least experimental of whom used to go by the title analyst): what to do when there is either too little, or too noisy data. Interestingly, those two are happy to be called Data scientist, but in my experience, they rarely meet.

A traditionally trained statistician would evoke negative result and decide not to use the model and support to maintain the pre-existing approach. A machine learning expert might not care, apply the coefficient out of the model as is because they are presumably closer than a guess and is more likely to be openly skeptical of human expertise.

That has lead to some frustrating situation for me: me arguing we should censor things like negative speeds, while I was told that there was no problem because the results were regularised anyway. Building and picking proper factors to use in regression is something that you can partially get away with when having larger databases, and back-propagation can take over; before that, insights still do matter.

I have not meet many who can articulate that transition effectively.

It seems that you’ve met mostly the second category; they are possibly the larger group, but not necessarily the most influential. There is a core of people who are meaningfully different. The linked article seems to be from someone in between but closer to the second group.


More like 'analyst' in how easily it is thrown around. Calling a built in function in python or R is just about equivalent to calling one in Excel. Sure, you can claim that folks need to know more about what is going on, but honestly, how many have actually gone through the work of deriving the functions they're calling to begin with?


I'm wondering how useful deriving functions yourself is in the age of computers. I feel like knowing axioms about the mathematical structure you're dealing with and how to do proofs is very important, but it always struck me as odd that were still stepping through complex applied maths functions manually in pen and paper. Programmers don't bother say, writing our own hashtable implementation more than a handful of times in our lives, do we? Does forgetting how to derive hashtables mean we won't know how to use them effectively?

Genuine question - more than happy to be proven wrong.


>stepping through complex applied maths functions manually in pen and paper.

We do that because:

A it helps us understand them better

B it teaches us how to think, the way Feynman said "Know how to solve every problem that has been solved". Granted, it seems pointless to work through what is easily accessible through machine BUT it teaches how to solve new problems. I wouldn't consider using NumPy or Matlab as the first step towards solving a new math problem.

It's like using Assembly vs using a higher level programming language.


Completely agree. There's a lot of nuance in these algorithms, they're not as cut and dry as simply calling a package method and oftentimes they aren't optimized to your use case. I work in Machine Learning, specifically on NLP, and it is really obvious when interviewing potential employees who knows what SVD means and who just know the NumPy function. Most "data scientists" I've interviewed fall in the latter category.

edit-This is of course completely anecdotal experience.


I suppose my real question is - how many times do we need to do it? Once we have stepped through it by pen and paper once, or derived the result, how many times do we need to keep doing it? My experience in that mathematicians will do this again and again and again.


I agree. A smart data scientist doesn't waste their time reinventing the wheel: they build off the hard work of others. When necessary they can create what is needed, but they don't do so typically.

They are both more and less, in my experience, than statisticians (more flexible and solution-oriented, less rigorous and classical), than analysts (they can do more, in general, but a great analyst will be better at analysing and visualizing), than developers (they know more stats, less software engineering, and have great patience for wrestling data into submission). I like to think of data scientists as people who combine the skills of all the above to solve hard problems which exceed the domain of any of specialty (analyst, statistician, developer). It doesn't mean we're amazing at everything, just that we are effective, flexible problem solvers.

And for the record, machine learning, statistical modeling, and data mining are just a small portion of the pie. Being good at modeling and machine learning will not remotely guarantee success as a data scientist.


I respectfully disagree. While I understand where you're coming from, I don't agree with your distinction between an analyst and a scientist. Given the data scientist's typical compensation and expected experience, there should be a higher bar set for them that does include developing solutions from base. I understand the use of utilities, but far too frequently I find people who rely on packages to do their work don't really understand what they're working on (they often don't realize the underlying assumptions that the package writers made for them either). With your description of the tasks for a data scientist, I would label this as a Data Analyst's work if I was hiring one.

I could of course be wrong and have a bit too narrow of a view from my particular subfield.


>how many have actually gone through the work of deriving the functions they're calling to begin with?

Why would you waste your time re-inventing a wheel.

A good data scientist isn't good because he/she can ace shitty trivia, he/she is good because they know the right question to ask.


That's only part of it. A good data scientist is also good because they know how to answer hard questions.

In those situations math isn't "shitty trivia," but instead a tool to be leveraged against those hard questions.

You can consider the derivation of SVD to be shitty trivia while throwing np.linalg.svd around while engineering features. That's fine! Good luck visualizing that data in a meaningful way, or dealing with non-linear data, if you're ignoring that "shitty trivia."


> dealing with non-linear data

What is non-linear data?


Data derived from non linear inputs.

That is to say problems that can't be expressed by linear functions.

I.e. Y= mx + B is a linear function.

Y= ax^2 + bx + C is a polynomial (non linear) function.

Linear Programming (LP) involves solving a series of linear equations (something like Excel's Solver can do this).

When you are dealing with non linear functions you need to use a method such as Sequential Quadratic Programming (SQP).


Using a term like nonlinear science is like referring to the bulk of zoology as the study of non-elephant animals.

— Stanislaw Ulam

https://en.wikipedia.org/wiki/Nonlinear_system


I'm not sure this is what a data scientist is. It was supposed to be a research scientist (which is where the scientist part came from) that wrangles data and code. This individual should have both domain knowledge and coding chops while knowing how to conduct research.


That would make me a data scientist, but I do not think I am and still have to learn a few tricks from this guy (and others).




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: