Data science, as a line of work, is distinct from other technical roles in its focus on creating business value using machine learning and statistics. This quality is easily observed in the most successful data scientists I've worked with (whether at unicorn startups, big companies like my current employer, or "mission-driven" companies).
Implicit in this definition is avoiding the destruction of business value by misapplying ML/statistics. In that sense, I am concerned about blog posts like these (which list 50 libraries and zero textbooks or papers) and those who comment arguing the relevance of "real math" in the era of computers.
Speaking bluntly: if you are a "data scientist" that can't derive a posterior distribution or explain the architecture of a neural network in rigorous detail, you're only going to solve easy problems amenable to black-box approaches. This is code for "toss things into pandas and throw sklearn at it". I would look for a separate line of work.
–Medium Stats/ML, medium Engineering ("Data Scientist" or "Data Engineer")
–High Engineering on very large datasets, low/medium Stats/ML ("Data Engineer" or "Backend Engineer")
–High Analysis, medium Stats/ML, low Engineering ("Analyst")
–High traditional Stats, High Analysis, low ML/Engineering ("Statistician")
–High ML, medium Stats, medium Analysis ("Data Scientist")
–High ML, medium Engineering ("Machine Learning Engineer")
And let's not talk about the different CSV you can encounter... or how to get data out of databases that are literally the only reason some "guardian DBA" still have a job. It can take months to get any access, if you ever get it...
I understand the math behind what I do, but it's not a fair assumption to think that everyone reading my post will be motivated to pick up and understand the math before they start applying the tools.
Especially with tools like scikit-learn and orange, it's especially easy to misapply ML and statistics or simply approach a problem without understanding the tools and come out with something that looks plausible to the untrained eye.
Key to the reason that you should understand your tools, including the math that underlies them, is that you should be able to look at the results of your work and know if there's something "off". And beyond that the underlying understanding of the math involved gives you the tools you need to debug.
The disadvantage is:
You never know you are right for sure, plus there is extra time spent on applying your experience to each new type of problem.
The advantage is:
You can easier relax assumptions once it is set up, and learned heuristics to deal with new problems quicker than the perfect way.
This is an incredibly important point.
I'm working as a fundraising and marketing analyst for a non-profit, but my background is in biology. The skill-set needed for analysis is pretty similar between marketing and population ecology. If you ask someone in either field what the biggest barrier to analysis is, getting data would almost certainly be the most common answer for both fields. However, data is treated very differently between the two fields.
On the scientific side, I find that most of the frustration occurs because there isn't enough data to make a conclusion. Peers will criticize conclusions made with insufficient information.
On the business side, I find that I'm often pressured to make claims that are much more confident that the data is capable of being. As a scientist, I am always very aware of the limitations of my data, but in business I feel like I'm pressured to make conclusions, and that people are waiting to make decisions based on any information they can get out of me.
I spend more time on my write-ups than I do planning my experiments, collecting data, and performing my analysis combined. In a business setting time "moves faster" and the stakeholders in a project expect results no matter what. In these cases, communicating what the limitations are in a concrete way is really important. Expressing risk in terms of money, or probability in terms of coin-flips makes a pretty substantial difference, and can really help people relate to the information you are presenting.
I tell you this just to help you understand what you describe. But in my observations of failure modes in business, it is rarely because one follows the wrong analysis, but more because most are unwilling to make any changes unless confronted with overwhelming evidence. (And that hurdle always gets higher no matter how much evidence you give.)
That's probably the second most common problem. I'd say 80% of my job is just fighting confirmation bias. So if someone thinks something needs to be changed, they'll take any sign that it should be changed. If someone thinks something should stay the same way, they'd argue with god about it.
I probably propose changes more often than I propose keeping things the same way, if only because testing an idea and gathering information requires making a change somewhere. I have a lot of conversations with people who are pressuring me to make a conclusion that the current way is best as soon as possible, so they can throw a lot of money at their pet project.
I'd say that most of the claims I'm being asked to make with limited evidence would be supporting the status quo, which is in line with your assessment.
Is this related to my comment? I used "age of computers", but close enough. It's really not a fair representation of what I said at all.
I stressed the importance of knowing theorems and deriving proofs - arguably "realer" math than learning an equation by rote. I did some applied maths in undergrad, and in my experience a lot of my time was devoted to solving large and complex equations using fairly mechanical rules, and comparatively little of my time was spent on axioms and proofs. I wonder whether this focus is justified in the age of computers - might we derive the complex formulas just once or twice as an exercise, and not step through them ourselves again and again? Might we focus more on what the computer can't do well for us - rigour and intuition?
It was initially related, yeah, but I realized I had uncharitably read your point. I edited my comment, but not enough. Sorry about that.
To be fair, this point is often raised in these threads as "why do math when computers do it for us?" so the criticism wasn't specifically levied against you.
We agree that repeated derivation when working on a new problem can be useless. It would be silly to work out OLS assumptions from first principles upon any import of sklearn.linear_model! I believe understanding those assumptions, though, or (say) how backpropagation works is important, since (1) it can help you debug issues and (2) explain modifications to the core models (GLMs or LSTMs, in the above examples).
The same type of thing seems to happen in other fields, too. Software engineers who don't engineer, data scientists who don't 'science', project managers who don't manage. Are they top in their field? No, they somehow have a job with the title though and so far have managed to not become unemployable. Do they care if they are rigorous in what their title is expected to be by top practitioners? Probably not, they get paid still and have the title, and can probably get hired at the next similar place.
Kind of sad that these positions may 'cheapen' the title, so what can be done about that? Not much I guess, since companies can use position titles as they'd like it seems...
Then again, maybe I'm just working at companies with problems that are amenable to easily-understood algos but have plenty of data-and-product-themed problems.
- Works on non-mission-critical components, e.g. he's not doing statistics for the when the wing will fall off your airplane, but he can help you figure out business problems more open to interpretation, e.g. subject line open rates.
- His publishing tools favor flair over convention, e.g. Ctrl+f for "latex" has zero results, but he does have D3, C3, Bokeh, surprisingly no tableau.
- Not sure he even references a single classical statistics package. The vast majority of people publishing in social sciences or "old school" life sciences are using Minitab, JMP, R, or SAS (correct me if I'm wrong, please, it's an outsider's perspective).
This skillset is not inherently "cutting edge!"- or deceptively "all talk, no walk". They really are completely different roles, that use some of the same tools and formulas and jargon. To cut to the heart of it: When a company builds a plane and says "I wonder how unlikely it would be for the wing to fall off?" that creates the demand for a statistician. When a company is trying to out-compete others, or maximize profit/charitable-effectiveness, often in a service or a field that is heavily influenced with human psychology, that creates the potential for a data scientist to add value.
As for LaTeX, it would have never occurred to me to add it. I have no idea why not, but it doesn't. Maybe because it feels more like a chore than a tool. It's like an anti-tool. I mean, I do or did in the recent past use LaTeX, but in more recent years I would farm that out to someone junior to me who hadn't worked with it for long enough to prefer pouring bleach in their ears to being faced with tweaking one more broken LaTeX template.
I probably should include classical stats packages. They really should go in here. But I've been coding since I was a kid and typically eschewed classical stats and math packages because of my perception that they were slow walled-gardens, and that as soon as I had a method figured out in Matlab or SPSS I'd end up rewriting it in C, C++, or Java to make it work with other things or at scale. That was hammered home in the first company I worked with where we did modeling in SAS and then rewrote every model in Java because SAS couldn't keep up.
I'm not suggesting that classical stats packages aren't data scientists tools. I think they are. They're just not my tools because of the curious niche I found myself in.
I have some of the same issues. The Engineers here tend to reach for spreadsheets first (or Access databases - these things are everywhere at my work) and inevitably they run into scaling problems and end up with a huge bloated mess. I step in to re-architecture these monstrosities (using "real" databases when necessary).
The other big part of my day to day work is modelling and data analysis. Usually regression based stuff and LP optimization problems (SAS is very good for this) especially around yield and quality control. The venerable excel "solver" plugin is often abused very heavily by engineers and is not always the ideal solution.
The person who I took over from was a Stats guy and the original job title was "Process Statistician" my boss has since retitled my role "Data Management Engineer". I still think of myself as an engineer first and foremost and a "data" person second.
I use SAS heavily. We have kind of gone in the opposite direction to you. I have rewritten some of our models in the past from C++ into SAS mostly for ease of maintenance because SAS is better understood by the non programmers (Most of the Engineers here do not have a programming/CS background and those that do tend to either know Fortran or Visual Basic very few grasp C/C++ very well). Speed is not really any issue but opaqueness and ease of maintanece is.
I'd like to learn R because I have heard it is very similar to SAS but more transferable to outside companies. Julia is the other language I've got my eye on I have heard it is somewhat similar to MATLAB which is used for some modelling work here.
then it starts to become a tool :)
Also, I don't know about putting Mongo and Cassandra under "Tools for working with unusual datasets".
> Machine learning and data mining are not well distinguished, but machine learning techniques increasingly favor “unsupervised” learning algorithms.
The statement above puzzles me because it does not align with what I can see in the news. Maybe I'm just uninformed, so please let me know if I'm wrong.
According to what I can read in the news:
1 - Almost all of the recent ML developments that I can think of are in the field of supervised learning / reinforcement learning
2 - the only field that I can think of where unsupervised learning techniques are prevalent is data mining, which is precisely why I see it as a very specific field.
Am I missing something?
I am glad to see machine learning, ai, "data science", whatever, grow as a separate field. The statistics programs had their chance.
Title inflation exists, but there is a real-world role here that isn't really captured by "statistician" at all.
To me, data science is more than understanding statistics, it's been essential to know how to scale them up and out.
If you're a domain scientist, you won't necessarily learn how to write reusable tools that are performant (or runnable) on data that is different from your initial model data. I once worked with a group whose model had grown so unwieldy that their config file was in NetCDF.
I found my niche was often in doing things that were slightly (or completely) outside the comfort zone of most domain scientists who were competent coders themselves, but who didn't have the funded time nor the inclination to learn things like database, visualization, and networking technologies that became necessary either to share their work with other research groups or to operate on larger datasets.
One project had me take a big model that was normally run twice a day and on a 4km grid and help write something that could run and visualize the results of the same thing on a 0.5km grid over a larger area and hourly. And then devise something that could help them visually explore the timeseries as it evolved, sometimes over months.
Designing the pipeline that can handle that is outside the scope of most scientists, even the ones who are good coders.
A traditionally trained statistician would evoke negative result and decide not to use the model and support to maintain the pre-existing approach. A machine learning expert might not care, apply the coefficient out of the model as is because they are presumably closer than a guess and is more likely to be openly skeptical of human expertise.
That has lead to some frustrating situation for me: me arguing we should censor things like negative speeds, while I was told that there was no problem because the results were regularised anyway. Building and picking proper factors to use in regression is something that you can partially get away with when having larger databases, and back-propagation can take over; before that, insights still do matter.
I have not meet many who can articulate that transition effectively.
It seems that you’ve met mostly the second category; they are possibly the larger group, but not necessarily the most influential. There is a core of people who are meaningfully different. The linked article seems to be from someone in between but closer to the second group.
Genuine question - more than happy to be proven wrong.
We do that because:
A it helps us understand them better
B it teaches us how to think, the way Feynman said "Know how to solve every problem that has been solved". Granted, it seems pointless to work through what is easily accessible through machine BUT it teaches how to solve new problems. I wouldn't consider using NumPy or Matlab as the first step towards solving a new math problem.
It's like using Assembly vs using a higher level programming language.
edit-This is of course completely anecdotal experience.
They are both more and less, in my experience, than statisticians (more flexible and solution-oriented, less rigorous and classical), than analysts (they can do more, in general, but a great analyst will be better at analysing and visualizing), than developers (they know more stats, less software engineering, and have great patience for wrestling data into submission). I like to think of data scientists as people who combine the skills of all the above to solve hard problems which exceed the domain of any of specialty (analyst, statistician, developer). It doesn't mean we're amazing at everything, just that we are effective, flexible problem solvers.
And for the record, machine learning, statistical modeling, and data mining are just a small portion of the pie. Being good at modeling and machine learning will not remotely guarantee success as a data scientist.
I could of course be wrong and have a bit too narrow of a view from my particular subfield.
Why would you waste your time re-inventing a wheel.
A good data scientist isn't good because he/she can ace shitty trivia, he/she is good because they know the right question to ask.
In those situations math isn't "shitty trivia," but instead a tool to be leveraged against those hard questions.
You can consider the derivation of SVD to be shitty trivia while throwing np.linalg.svd around while engineering features. That's fine! Good luck visualizing that data in a meaningful way, or dealing with non-linear data, if you're ignoring that "shitty trivia."
What is non-linear data?
That is to say problems that can't be expressed by linear functions.
I.e. Y= mx + B is a linear function.
Y= ax^2 + bx + C is a polynomial (non linear) function.
Linear Programming (LP) involves solving a series of linear equations (something like Excel's Solver can do this).
When you are dealing with non linear functions you need to use a method such as Sequential Quadratic Programming (SQP).
— Stanislaw Ulam