In all seriousness, why can't data science simply be about applying the scientific method in the realm of data analysis? It doesn't need to be conflated with machine learning, BI, SQL, etc. It can just be about approaching data analysis with scientific rigor.
My opinion is that the term data science evolved when we started needing cross-functional people who are a blend of:
- domain experts;
- numerical/quantitative specialists (such as statisticians, mathematicians, physicists, STEM people);
- business analysts, business intelligence; and,
- those who traditionally deal with data management, platforms and tools.
That confluence of people was needed amidst the related trends:
- increased government funding for STEM education and brain research;
- marketing from companies such as IBM ("Watson"), the democratization of data and increase in the use of data in daily life;
- the big data wave, subsequent interest in "internet of things" and "digital transformation";
- renewed interest in machine learning and AI (recurrent neural networks and other breakthroughs);
- and others of course..
We needed to apply more discipline to data analysis - thus data science was born. A formalizing of what many were already doing, to capture the need and changing paradigm. Or so I like to believe.
Because then someone with a business school education (and zero formal statistical training) wouldn't be able to do it. I joke about waiting for finance's Excel models to be rebranded as AI, as I've already seen a handful of hedge funds rebrand their analysts as data scientists.
Maybe I'm being a curmudgeon. In my book if you can't build the statistical tool you're using, you don't understand it. So if, in Excel, you can fit a regression from "scratch", (i.e. not using any built-in regression functions) and use the built-in functions for convenience, that's fine. If you can't, you're a regular financial analyst. (Nothing wrong with that. I was one once.)
This is important because being able to build it means being able to tweak it. Excel's tools have quirks and make built-in assumptions about your data. If those assumptions don't hold, you should be able to tweak (or change) your approach. Being limited to built-in models removes that flexibility. It also implies you don't know when you're crossing between "my tool works" and "my tool is outputting garbage."
This is absolute pure silliness. So the guy down the street that can't manufacture a new carburetor isn't a mechanic? The doctor that can't create penicillin in his personal office isn't a medical professional?
More often than not, the people that create new systems are not the same people that take these concepts and apply them to practical business problem. It takes many kinds of people to help a company grow.
For work to be "science", it really ought to be transparent and reproducible within your community (eg, team within your firm).
Here is a summary of the story: https://www.nytimes.com/2013/04/19/opinion/krugman-the-excel...
I suppose it depends on what you mean by "can". Much like whiteboard interviewing, I bet I can't sit down today and implement a regression from scratch, because I'm out of practice. I have had no need to retain that information; there are tools that do it for me.
Could I do it quickly with a stats 101 textbook in my lap (or 10 minutes of Googling)? Absolutely.
To be clear, my original comment was not criticizing Excel. It pokes fun at analysts who couldn't tell you a Type I error from a Type II being branded as "data analysts" because they build DCF models.
Certainly it has good features to create what you want, and you could edit source (if you can read it!)...but it's used by people who have a elementary knowledge of web development.
That's what a statistician do.
I've seen these ML and Datascience people. And the majority the time how they tackle data is radically different from statistician and is more of an art than a science compare to what statistician does.
But this could be my bias opinion and just some small data sample from personal experiences.
Actually my last day of internship I've met a few statistician interns some of them are from Cal (UCBerkely) and they came to the same conclusion (we have a lot of complaints). The ML/DS group is really just doing black magic (nicest way of putting it). I wish statistic is better at marketing. Oh well.
Run that experiment for me next time you meet a statistician:
- ask him if he can apply Chi-squared to a decision problem
- ask him if he can *explain* how and why Chi-squared works.
Learning how to use a screwdriver to screw screws without understanding notions of torque and moment doesn't mean you're applying the scientific method.
If a medical researcher is testing a new radio-therapy treatment, but can't mathematically model every fission problem you can throw at them, they're still applying the scientific method.
I think a phd statistician can do this.
Master statistic does not touch upon field and measure theory in statistical inference so many questions get unanswered. But I suspect you may be correct.
I view chisq as a statistical distance for most of my encounter and learning.
I sometimes work with people who are formally trained in stats at a much higher level than myself. But with my lesser ability combined with my capability to write production grade Python code using the pydata stack, my code and solutions are more likely to make it in.
I'm coming at it from just a modeler role and it seems like you are stating more of a full stack within the data science. I took OP remark as if it was just within data analyst in term of analysing the data and not including turning it into some kind of application.
As for Python, time series isn't as good in python. I do agree that Python production is good. But I don't believe Python have better analysis packages. I'm in the camp of using programing language for their strength and using Apache Thrift or whatever to tie everything in.
Time series isn't as good
Linear regression isn't as good
Mixed models are awful
Good luck fitting a spline without diving into scipy and doing it as an interpolation
R packages are almost always accompanied with a rigorous paper and a great vignette, whereas in Python, there are almost no talks of the implementation and just documentation about how you can use the library.
Every model we develop has assumptions and shortcomings, R offers the tools to diagnose and examine that, whereas in Python, I get the feeling that it's more like "some smart guys figured out this formula, here's an implementation of it".
But you're right. I find most Springer, CRC, etc.. books are in R with accompanying R packages. Shumway & Stoffer Time series book comes with the R astra package. Andrew Gelman Bayesian create Stan and their target was R first (rstan). The Sanford people who created lasso and elasticnet created glmnet. The thought that the creators or experts of these subject wrote a book or research paper and then also publish R package to accompany it is reassuring.
When it comes to just data analysis, it seem like R is pretty good or better than Python. I think Python is much better for ML stuff such as deep learning and also it seems like Python have better NLP support. Arguably Stanford is very good with NLP and they publish their tool in Java.
> "... they tackle data is radically different from statistician ..."
the short answer is, before we had limited data, compute, and the problems businesses needs increased substantially.
now we can do better things, and if you cant do more than what companies were doing in the 1950s, then your value proposition is significantly less than someone that's using industry practices established in 2017 / early 2018.
> "I wish statistic is better at marketing. Oh well."
as a data scientist, I have to know stats, ML, big data tools (spark/splunk/hdfs), programming, and domain specific knowledge. some data scientists are just rebranded statisticians, which is fine. the role of a data scientist varies to the needs of the org they are in. I mostly do anomaly detection and classification, while others may focus solely on AB testing.
> > "> why can't data science simply be about applying the scientific method in the realm of data analysis?"
the statement doesnt make sense to me. if you are doing data analysis then by default you are doing observations and measurement on empirical data.
As far as experimentation goes, almost no one has the resources or the standing to do that. Some do, like those AB testers I mentioned before.
> "Actually my last day of internship I've met a few statistician interns some of them are from Cal (UCBerkely) and they came to the same conclusion (we have a lot of complaints). The ML/DS group is really just doing black magic (nicest way of putting it)"
Then they don't appreciate ML and the problems they are trying to tackle. It reminds me of a stats PHD co-worker that was upset about the number of parameters in an image classification problem we had. they didn't think a model could be ran that had more parameters than observations.
I was like we are not running linear models ... As an experiment, go to kaggle and try the spooky author classification competition. See how far you can get with basic stats and then see how much further you can get with ML. hopefully it will give you an appreciation for some of the ML tools available.
It's not black magic, like many other domains; its just a dense subject that takes a lot of time to understand
Even if you did a CS degree and then did a BMath with a statistics major you still wouldn't have all the skills you would need, though you'd have a great place to start off from. There is an art in kinda guiding / interpreting the mathematics that's hard to teach and most people in the field kinda pick it up through experimentation. I'm sure it will get more formalized at some point, but I don't think the formalization will be all mathematical or scientific; I think much of it will be making explicit how to think about certain domains or a system of process that are useful.
Also I think there is a big difference between Data Engineers, Data Analysts, Data Scientists, and AI researchers.
Depending on the field, an AI researcher deals more with idealized forms and is much stronger on the mathematics of machine learning.
Data engineers are usually stronger on the CS topics that deal with scale.
Data analysts, even the very best ones, tend to be weaker on the CS, the math, but tend to be the strongest at quickly understanding data and making it intelligible to decision makers. I'm not knocking it—I was an analyst at one point—it's just the truth that some positions take more skill than others and most great data analysts tend to treat it as a 3 or 5 year stop before levelling up to management or a more technical role.
Data Scientists generally have the skills of a data analyst (though they have trouble dumbing things down at times if they came here from something other than an analyst position) with some of the skills from the other two.
The way some application code software developers dismiss a whole sub-discipline is kinda embarrassing.
But it's also disappointing the way that you belittle all machine learning practitioners, even those with academic credentials, for their work not being worthy of serious consideration. This also sounds a little defensive and projective, and I can't imagine it's easy seeing the forest for the trees with your head in the clouds.
I do want to give a personal view and constructive criticism but any intentional and direct attack toward another domain is not my intention.
I think statistic is very well equipped to do just data analysis. The discipline have many weaknesses I do acknowledge that but I don't believe data analysis is one of it when it is the core tenet of what statistic is.
I believe data science is too new and is still trying to find it's standing. Also it seems like a jack of all trade and a master of none discipline. I don't believe the discipline can be a master of everything including data analysis with all other things it's trying to incorporate.
My critique is that for the original post is that there is already a field for data analysis. Just for it, for a century now, it's named statistic.
ML/DS is a new breed that is more than just data analysis. Because of this they can do many things but I don't believe they are an expert at any one thing. Which is fine. But I do understand that this discussion is sensitive since people within the DS/ML since that is how they make money and earn a living.
They tried to fix it with PCA.
"Data science" is the most natural name for this field. Though fields like "information science" and "political science" are broad, "data science," as it is popularly defined, is uniquely narrow. This is problematic because general fields typically serve as a roadmap for all subfields - a cursory glance at what they are and how they relate. "Data science" today does not provide this road map.
It's like saying "Web Designer" nowadays, when in reality we have a variety of job specializations like UX strategist (ux only), graphics designer (photoshop), UI developer (html/css only), front-end developer (html/css/js), etc.
I suppose someone who isn't very smart is staffing data analysts with little understanding of science. Most would assume that if you are getting paid to analyze all this valuable data, you would have some grounding in the scientific method.
There is a crossover : data science is often about constructing data resources from other data resources (and then advanced analysis like : count how many x) doing this rigorously and efficiently and with regard to the underlying infrastructures and other users (don't kill production) is a big trick.
Day to day, it's mostly SQL, or worse Hive queries which makes most things much slower than they should be.
I've had a ton of data science interviews which ask how to reimplement binary search from scratch (which I would never do on the job), but not anything about how to do efficient JOINs and query nesting.
Also hence why we introduce new members of the team to learn how to do efficient queries and joins. And spend time upfront to structure their problems.
Could you point me to some materials/texts about how to improve querying efficiency for SQL? If it's oriented for beginners then that would be ideal.
Thanks in advance.
My team focuses on traditional reporting (data in oracle, manipulated with SAS, sometimes statistical modeling with SAS, resulting in reports/dashboards presented with QlikView).
The IT team approached me to ask if we wanted to explore any use cases of Splunk for my team. I’ve looked around a bit and don’t understand this stuff enough to know if there are any. We generally are working with health insurance claims data, and have no problems being able to process the data we need using oracle/sas.
Any ideas spring to mind that might make me want to invest time in looking at this?
Is it really trope? For my experience I almost think collecting data is >80%.
The super-cool ML stuff that attracts people to the field in the first place, accounts for little more than a rounding error in how the time is really spent
I know what I need which is a wide open easy to use (think CEO's personal assistant who has zero background in anything other than basic business processes easy) visual ETL tool.
We have found some decent VERY expensive and VERY user unfriendly tools but nothing cheaper than just hiring a person to custom handle each batch.
Basically...we're stuck with Cognos, et al.
- months: convince management to give access to data source
- weeks: try to find the connection string
- days: clean up the data (mostly converting dates to yyyy-mm-dd) and importing/exporting csv files
- hours: load data in database, write simple SQL query and simple visualisation
- seconds: brief moment of satisfaction
It may not be safe to assume the inputs are biased , but, then, it's also not safe merely to assume that they aren't.
What's particularly dangerous is to assume unbiased inputs if the subsequent conclusions (the outputs) then influences future inputs . That's the feedback loop that can serve to amplify biases in the data/conclusions with each cycle.
> must that be anything-ist? Or just a fact?
To answer your question, no, it doesn't have to be anything-ist and is, of course, a fact. However, to characterize it as just a fact ignores that it may well be an anything-ist fact, if it had been based on anything-ist data.
Ultimately, this is just a GIGO  problem, a concept from early in "computing" (aka "data processing").
 which I'm substituting for "flawed", since "unbiased" was the OP's term
 for example, by setting interest rates which have an influence on loan repayment likelihood
So if i were to go and beat up every person named John, and then machine learning takes a crack at it and tells us that people named John are more likely to get injured, we may end up discriminating against Johns without realizing it was I who just have a thing against people named John. If this happens in a system and creates a feedback loop, It can lead to something becoming a self-fulfilling prophecy when it need not have been.
Based on a naive conclusion the solution may be to deny health coverage to people named John, but obviously the real solution is to put me in jail.
In medicine, that would often mean a randomized study between treatment A and a placebo.
However, with factors such as race, or sex this is obviously impossible.
Also know as a cofounding effect. All of these problems have been encountered by statisticians decades ago.
Did you know that if you correct wrongly for a confounder, you actually introduce bias?
As a very modest medical researcher myself, I have become very careful about the conclusions I make with any model I make. While it is very appealing to make conclusions about causation, they are very often wrong.
For more information, I think this might interest you: https://www.hsph.harvard.edu/miguel-hernan/causal-inference-...
If you're genuinely curious for an answer, in the United States it's absolutely the case that you are legally
> obligated to throw out the conclusion
when it comes to many of the things you mentioned (e.g., you can't consider race, or often gender and sexual orientation depending on the state, when you evaluate a loan/employment application).
The reason for those laws is that, historically, overtly and openly racist people used the loan application process to discriminate in housing.
Asking to be able to use any accurate statistical model for any application is an inherently political request, because bad actors in the past have attempted to exclude people based upon skin color. We can't make that discrimination legal whenever someone can dream up a plausible mathematical model justifying an ultimately racial intent.
The real point of my post is that a discussion on this topic devolves into a flame war unless everyone agrees that we can't turn p-hacking into a legally justifiable way of allowing racial discrimination in important social processes like housing loan or employment applications.
> an ultimately racial intent.
I reject your assertion that by studying data, data scientists or analysts are somehow complicit in discrimination or any -ism. Their job is to make business decisions based on data, WITHOUT relying on "gut feel" or other human biases. I also reject the expectation that data scientists have an obligation (or the ability) to "fix" whatever biases may be revealed by the data.
Again, you are making assumptions that data scientists are somehow evil, engaging in shady or illegal tactics to promote discrimination rather than simply doing their job in a straightforward manner. P-hacking is often done by individual researchers looking to get a study published in a scholarly journal (as opposed to months/years or research failing to reach a statistically-valid conclusion), rarely by companies who are looking to make profitable decisions based on what the data is indicating.
of course they do that, that is literally their job.
But to reverse-engineer the biases that may or may not exist in an original data set -- please explain how this should be accomplished, because I don't see how someone could accurately quantify the amount or degree of race/sex/age/religion/nationality-ism without introducing additional "bias" based on that person's own opinion.
> Data does not appear magically in a dataset.
Right, so why isn't the boss, or exec, or department head, or 3rd-party, who sourced the data responsible for de-biasing the data before even handing it off to the the data scientist, so s/he can just do the job of data science-ing, and not political science-ing? You're putting a whole lot of "ethical responsibility" on just one person -- ironically, the one least likely to be good at interpreting human emotional tendencies -- within a much larger ecosystem.
I never said the analyst was the sole person bearing the responsibility. You are right that just as much as responsibility must be expected from those that designed the information model, those that collected the data but also the person who ultimately analyses it and prepares it for whatever kind of dissemination. Everybody involved has to take their individual responsibility so that we achieve collective responsibility.
What I assume is that not all humans are perfectly rational individuals whose only goal is profit maximization. And thatwe cannot see peoples souls, so we need laws that err in the side of caution.
I also assume that homo economicus is explicitly disincentivized from reasoning about feedback effects and historical context, which are two things many actual homo sapiens care about (for obvious reasons).
E.g., disregarding feedback loops and the long arc of history, ie in a vaccuum, supporting racialized slavery is perfectly rational for non-enslaved people interested in profit maximization. Cinsider also a completely not raciat loan lender with vested interest in high property values who knows dark skinned people lower property values. Even if the data scientist is perfectly unbiased, bias and hatred in the underlying population can result in data driven, profit motivated decisions that harm marginal groups. The fact that this really actually happened en masse is WHY we have these laws...
Regarding the latter point, you are effectively dismissing all non-consequtialist ethics and associated legal traditions as "gut feelings". I submit that these gut feelings play an important role in human societies made up of irrational people. In fact, they are even important in societies with perfectly rational people who are not super reasoners with perfect forsight.
Adding profiling features does give more accurate predictions. However, I pitched not using these features as a competitive advantage to the founders and they (luckily) agreed. We won’t be using them, and we ended up (with more work of course) getting a similar performant model without them.
We can and should try to not use those kinds of features and be as fair as possible.
If Group A is low credit risk and Group B is high credit risk, won't your new model still make it much harder for Group B to be approved? If not, it's not "similar." If you are able to dissect the Groups and pull individual high/low risks from within the groups, that would be a superior model, which is not what you claimed. So how are you not profiling, and how does it make a difference in terms of who gets approved/rejected?
Sounds like it wouldn’t make any difference (other than PR).
In the real world, the state of affairs in Data Science is more practical and pragmatic. And there's nothing wrong with that.
"(2) decision science, which is about “taking data and using it to help a company make a decision”; and (3) machine learning, which is about “how can we take data science models and put them continuously into production."
Machine learning isn't about putting models into production. It's about machine learning models directly from data.
And if decision science is 'taking data and using it to help a company make a decision', then pretty much any job involves data science, e.g. the guy comparing quotes for paperclips and picking a vendor.
From a business perspective, the thing that's different about "machine learning" compared to other things you do with data is that it's possible to take the human out of the loop. That's a qualitative difference, as opposed to the quantitative difference of your business analysts giving better recommendations. We can quibble over terms, but as a broad stroke, things that are machine learning can do that and things that aren't machine learning cannot.
That qualitative difference is the main thrust of the quote you pulled, although it could be more explicit. Rather than the analyst building a model that tells him what shade of red is best for a button so that he can pass that information along to a design team, the button color is connected directly to the model.
'From a business perspective, the thing that's different about "machine learning" compared to other things you do with data is that it's possible to take the human out of the loop.'
There are many things you can do with data that take humans out of the loop, that don't involve machine learning. For example, software that automatically re-orders stock in a supermarket once stock (calculated based on starting stock less sales) goes below some level.
You could argue that this still has a human in the loop (to define a threshold) and that you're not removing the human from the loop until the thresholds themselves are automatically calculated.
But then you're just moving the job of the human from deciding the threshold, to deciding what % of the time it's acceptable to be out of stock of that item. Sure, you can automate that, too, but then the job of the human still exists: she's just deciding the objective function that stock-out percentage must satisfy, rather than deciding the stock-out percentage for each SKU directly using a jupyter notebook or Excel sheet.
OLS is great for many types of problem.
For others, other techniques massively outperform them in some way (e.g. CNNs for classifying camera images or spectrograms of audio data).
Even where OLS performs well, it seems other techniques can frequently do better.
In AI, it's Hinton, Le Cun, Bengio.
In systems, it's D Richie, J Dean, Berners-Lee, Torvalds.
In distributed systems, it's Lampord, Chandy, J Dean.
In programming languages, it's D Richie, Gosling, Dijkstra, Knuth, Milner, etc.
Who are data scientists' heros or role models?
> She was also a pioneer in the graphical presentation of data. At a time when research reports were only beginning to include tables, Nightingale was using bar and pie charts, which were colour coded to highlight key points (eg, high mortality rates under certain conditions). Nightingale was keen not only to get the science right but also to make it comprehensible to lay people, especially the politicians and senior civil servants who made and administered the laws.
Gelman, Tufte and Wickham are in the pantheon, sometimes followed by dataviz fellows like Cairo.
Never heard of Warden.
Silver would likely be described as a 'data journalist' rather than (or at least as often as) a data scientist.
All these are common statistical learning methods used in Data Science.
This comes especially obvious in the epilogue, where they try to give a quick oververview how the concept of a "data-science" formed and how statistics diverged into data-science + ML and the traditional statistics-community.
They end the book by arguing that both communities should find to each other, because fundamentally they try to do similiar things. I also think is badly needed. Unfortuntaly I experience some of arrogance on both sides, which makes it harder! (DS/ML-people have no idea what they are doing and only throw their algorithms on problems & benchmark them! Statisticians are obsolete & i can just automate them with NNs!)
Lamport, in case anyone's googling :)
I mean, it made it sound like data scientist is just the same as a business analyst? Is this the new computer scientist vs software engineer?
I think this demonstrates how hard "titles" are, because a "business analyst", in the sense that I learned, is not at all like a data scientist (or data analyst):
"Business Analysis'' is a research discipline of identifying business needs and determining solutions to business problems. Solutions often include a software-systems development component, but may also consist of process improvement, organizational change or strategic planning and policy development."
Most BA work I've done involved translating business requirements into technical or software requirements.
In other words, who knows...
I’m confused, if you know the full stack already don’t you already know all this? It’s all part of the stack after all.
“As a webdev...”
1. Kaggle.com competitions incentivise chasing a metric, where very large amounts of man-hours are used to improve a model a tiny bit, which may not have dramatically different performance in a real-world situation than a baseline model. (Kaggle Kernels, which consist of Python/R Notebooks, are better for teaching how to analyze datasets: https://www.kaggle.com/kernels)
2. Andrew Ng's Coursera course is good, for an introduction to terminology only. You won't be using anything from the course in real-world use cases (e.g. hand-implementing the matrix algebra behind backpropagation), but it's good to learn what backpropagation is. And please don't say you're an expert in machine learning because you took the course.
BUT I think kaggle still trains you in the right things. In my experience, the winners were almost always more creative and didn't win just through massive hyperparemeter grid-search. I also think that a lot of the usual non-ds struggles are just not as challenging.
If I would be in the position to hire a DS and theoretical skills would not be that important (maybe there's already somebody with strong theoretical skills in team), i would just hire somebody based on his kaggle profile. It will probably take some time till he adjusts to real-world DS, but I don't think it will be that of a challenge.
Only after that 90% is done can anybody think about modeling data, transforming it, processing it and lastly that glorious 5% of actually analyzing it.
Oh, and then somebody wants the results of the analysis to be put into a fully interactive scalable web application so now we're late.
Calling it science is a stretch. I can understand if you are solving problems in a traditional scientific field, but if you are doing economic modeling to manage investment risk and optimize profit for an internet company, it's hardly science. What a scam!