The accurate title is Data Analyst, something that has already existed for a long time and works just fine.
So, data developer?
I actually would find weirder if someone would define him/herself as a "Computer Scientist", even with a CS degree, outside of Academia :)
I think the creation of new data is the whole point, actually. Even at the lowest level, aggregation of data is still new data. Deriving insight by the use of more sophisticated methods is also creating new data. Arguably, all data is derived so just because a data engineer distills large data sets into smaller ones doesn't mean they don't create new data.
> I think the creation of new data is the whole point, actually.
I had the same hesitation when I heard "data developer" in my head. But I agree with JoBrad. If you have a pile of dirt, and compress some of it, then you have a rock where there was none. If you take a rock and "disassemble" it, you have nuggets of gold where there was none.
Really, titles don't matter too much (which is why I used my own :), and we know generally what's being discussed when we hear data science and data scientist.
Consider that all scientists are working with data, often very large and complex datasets, the term comes off like a joke.
I prefer being called a Software Developer over Software Engineer or Programmer. For a long time, my work is more to do with developing solutions related to capture, parse, enrich and visualize huge volume of data. Data Developer sounds weird.
What do you guys feel?
> Data science for people (Type A), i.e. analytics to support evidence-based decision making
> Data science for software (Type B), for example: recommender systems as we see in Netflix and Spotify
Isn't "type A" business intelligence, and isn't "type B" machine learning? Why doesn't he use those more widely known terms? Or maybe he is referring to something else?
My Job basically aligns with Type A above. I mostly work on optimizing our industrial process through a combination of modelling and simulation work. Other than that I do quality/defect investigations when we have issues with defective batches etc and I do yield optimization work. I also oversee various plant trials as required.
I use Business Intelligence software (like SAS and COGNOS) but I use other tools as well (including my own C/C++ code). I lean heavily on my own theoretical knowledge - in particular metallurgy and minerals processing. (I am a Materials Engineer by qualification). I think most BI people would lack the background theory to do my role.
My Job title was more or less arbitrarily chosen by my manager (Other people in my team have 'Automation Engineer', 'Process Engineer' etc. as their role titles). I consider myself an Engineer above anything else.
The people who do the work you mention (schemas etc) are called Database Admins here. They work for Information Services - different department I'm not all that familiar with.
"Machine learning" includes plenty of activities that can be used to provide evidence for one-off human decision making (e.g., using a model to produce forecasts or to understand sensitivities).
Helping others wrangle data is one of the reasons I publish my Jupyter notebooks open-sourced. A few examples my data wrangling with R:
Processing Stack Overflow Developer data: https://github.com/minimaxir/stack-overflow-survey/blob/mast...
Identifying related Reddit Subreddits: https://github.com/minimaxir/subreddit-related/blob/master/f...
Determining correlation between genders of lead actors of movies on box office revenue: https://github.com/minimaxir/movie-gender/blob/master/movie_...
I'd add that Kaggle is very good for the "other end" of data science: they generally have pretty clean data, and clear problem descriptions.
In real life the data is never clean and the problems are rarely known in advance.
The best I can say is to go to grad school. That's a terrible answer, but it's perhaps the only realistic one. It's in that situation, or one very similar, where you're exposed to loads of criticism and discussion. Basically any paper that was competently written (even if it wasn't competent work) is going to sound convincing to the naive. After hearing a few papers get torn down, you'll see the cracks in weak arguments, the poorly supported conclusions, and the seemingly boring stuff that's absolutely brilliant.
Very generally, the best sign of good work is a 'masochistic' author. What I mean by this is an author that writes as though every result they get is deeply suspect and needs to be corroborated in multiple ways. When it's almost exhausting to read because it feels like they're just beating themselves up, you're probably reading something really special.
Likely the most difficult thing to do as an 'outsider' is to get a sense of how 'trustworthy' certain results are. Some methods are almost binary in that you either get no result, or a great result. If an author shows this, there might be very little reason to doubt it, and thus independent lines of evidence not really necessary, especially if there's context that supports / is consistent with that result. Other methods are notoriously terrible and need a great deal of careful controls and analysis to even be considered, and then only as one angle of attack. Sometimes you can find reviews that discuss methods like this, which would be an invaluable resource. Reviews are generally a great way to start reading a field, anyhow.
As a practical guide, a well written paper can just be read start to finish. Then reflect on it to see if you understand it. Could you explain the paper to someone else? That's a good sign of whether you understand it. After that, think of critiques. Could the results be interpreted different ways? Was the analysis appropriate for the data? Are the methods reliable? All papers have weakness; we live in a world of finite time and resources. All papers could be better, so think about what could be done. After that, consider what would be reasonable to do. Did the authors skip something conspicuous? That's a good sign that there was some difficulty there were avoiding. That might be fine, but it also might mean there's data that doesn't fit with their conclusions, which would be a very big issue indeed.
That latter part is the most important, but also the most difficult to do. It requires reading dozens, really hundreds, of papers so that you learn about some 'unknown unknowns'. Hearing talks really helps with this, too, as many people will give a sort of history of their work that includes some of the twists and dead ends.
That all said, anyone can read a paper. It's not 'magic' that lets you do it. You'll miss some of the nuance, and occasionally be lead astray, but peer review works reasonably well enough that papers are mostly quite good with the devil held to the details. Like most things, it likely follows the Pareto Principle, with a little effort bearing outsize results.
In this case, not only is ALL CAPS utilized, it hits the No True Scotsman fallacy.
I know that some meteorologist have used normality in forcasting. This is an example of why you cannot become a data SCIENTIST. Another example; applying regression to your data. If you think that regression is as simple as its formula then you need at least 4 years to understand what I mean.
There are a lot of people with a rigorous mathematical background (mathematicians, physicists, biologists, computer scientists, ...) who are perfectly capable of understanding and applying stats concepts at a high level.
In addition, these people have a lot of experience with doing scientific research, so shouldn't they be even more qualified to call themselves "data scientists"?
Can you give an example of something that clearly distinguishes a "data scientist" from say a physicist who learned regression from a stats textbook?
For example you can learn regression from a stats textbook but unless you've gone through a thorough (and painful) graduate-level stats course, you probably haven't seen the edge cases that invalidate assumptions and necessitate a more complex regression e.g. your regression may suggest there is no effect but when you look at the residuals, you may find systematic bias that you can model using a subject-specific random effect or some transformation as a generalized linear model...
That isn't to say you need a graduate level stats degree but applying statistics without understanding the pitfalls can lead to seriously wrong conclusions.
I mean, I get it. You would like it for the word to remain some pure version of meaning that it actually never had. Similar to getting upset that people using the word literally in a very figurative sense.
The relevant question on this style of article is not about word smithing.
That said, I'm in a field where we often through the title engineer on people, but we don't know why.
To me, a scientist is someone who engages in research with generalizable results.
That would exclude someone who does experiments and analysis, but only applies established methods to a particular problem. Call them an analyst, perhaps, but science is an ongoing dialog that they are not participating in.
I think that's part of why many consider it really presumptuous for 'data scientists' to call themselves such. Some of them are certainly developing new methods and engaging in a kind of dialog that is definitely science. Others are addressing business needs with a new sort of analysis, drawing on the field but not giving back to it. There is absolutely nothing wrong with that, but it does seem like that pursuit is different enough to bother calling it something else.
More recently, a lot of scientists worked to figure out atomic energy. Most, in likelihood, we do not know the names of anymore.
I think of it as musicians. If you define musician to be rockstar, there are not nearly as many as if you include school teachers, symphony players, conductors, etc. Yet, for most people, this later set would definitely be considered a musician.
Now... I deserve to be literally roasted for some of the other mistakes above. I should know throw versus through... Not even pronounced the same...
Or, then I can become a doctor right? By studying hard at home?
You should not employ a self thought software engineer or statistician or a doctor in critical positions such as health, finance, engineering, and science. Some might be okay but 99.9% of them will be incompetent for the work.
See, here's the deal. Numerical literacy (numeracy) comes about through many processes, one of which is obtaining a statistics degree.
However, not many fresh statisticians could tell you or me the expected runtime of a taking a numerical field's median on a large database. Why? Because the backgrounds are different. Yet, for a solid data science team, you'd like at least someone who can give insight as to whether your project will take minutes or decades to complete. This (contrived but important) example suggests that while a BS in statistics is an important and necessary component of a data science team, it not sufficient.
95% of undergraduate statistics education is focused on formal inference. Data science, in my experience, involves a lot more exploratory data analysis  than formal inference (frequentist or Bayesian).
The extreme focus on inference and the hypothesis testing step in the scientific method is something people with a formal statistics education have to overcome to be productive data scientists. Or applied statisticians, really! It is more important to understand the data, organize it creatively, and find unexpected structure.
Existing institutions do not have a monopoly on who is a scientist.