"Data practitioners", "data practice", "data management" just seems like weird rebranding of stuff that businesses have been doing since the 1960s. Partly what the article seems to be relating regarding SQL.
What is "data science" besides computing, data storage of some sort, and analyzing the data? Because it's "BIG" now? Because now we "extract knowledge and insights" from data, since apparently giant databases were amassed for no reason in the past? Because now we "combine multidisciplinary fields like statistics", since apparently nobody had thought to run statistics on their data before? Because "AI"?
I don't think that this leads to changes in the traditional ways of handling IT because of any fundamental change in a data analyst's needs. It's more that, historically they were outside the formal tech sphere, and therefore lusers who had to make do with whatever resources they were able to squeeze out of IT. Now, armed with better technical chops, they're able to make a stronger case for their needs, which means that more is being done to meet those needs.
Which is not to say that your skepticism is unwarranted. "Data science" is still a catch-all term for a new and ill-defined way of working, and things are still pretty wild west. Anyone who tells you they're sure they know what the best practices are is probably trying to sell you their product, which happens to be the best practice. Not many people have the benefit of 20 years of hard-won experience from which to draw a set of really strong guidelines.
Moreover it’s an iterative process where you start with hypotheses on what features will be useful then build a model and might need to pull more data. For example you could build a recommendation engine just using the set of products purchased by customers but then might explore adding user agent features like devices or might add data about the products like category or price information.
An analyst (“data scientist”) who can pull and interact with that data is much more effective than one who needs to wait for someone else to pull the data for them (most actuaries I know are weak programmers outside of R and wouldn’t know where to begin using distributed computing or getting a TensorFlow gridsearch running on a cluster of cloud GPUs :)
I like some of these technologies for taking care of the dragnet data collection and engineering, but, when I'm actually doing a data analysis, it's rare I want to run it on a Spark cluster; I'd much rather run a Spark job to collect the information I need and sample or digest it down to a size that I can hack on locally in R or Pandas. Yeah, there will be some sampling error, but the potential dollar cost associated with that sampling error is much lower than the cost to eliminate that sampling error. And it's basically zero compared to the elephant in the room: the sampling bias I'm taking on by using big data in the first place. "Big data" is, from a stats perspective, just the word for "census data" that people like to use in California.
Sure, the cloud providers sell computing power but it’s a race to the bottom and makes much more sense than buying hardware for these kinds of bursty analytics workloads.
I don’t think “Big data” from a stats perspective is analogous to census data - in some cases yes but for applications like recommendation engines you lose a lot of valuable signal by sampling.
I also way prefer to just crank up the RAM on a single instance and use Pandas/Dask instead of dealing with distributed computing headaches :)
I do think that most companies log lots of data in different places without much thought or structure, and as a result need a "data scientist", who spends most of their time simply aggregating and cleaning data that can then easily be processed on a single machine. ie: they spend 10% of their time doing statistics.
Maybe I'm wrong, but the above is how all the data scientists that I know describe their work.
Most of a data scientist's time is definitely spent getting data from various places, possibly enriching it, cleaning it, and converting it into a dataframe that can then be modeled (and surprisingly a lot of those tasks are often beyond pure actuaries / statisticians who often get nervous writing SQL queries with joins nevermind Spark jobs)...
But also once you've got that dataframe, it's often too big to be processed on a single machine in the office so you need to spin up a VM and move the data around which also requires a lot more computer science / hacking skills than you typically learn in a stats university degree so I think the term data scientist for someone who can do the computer sciency stuff in addition to the stats has its place...
I wrote a full blog post ranting about the disparity between expectation vs. reality for data scientists a year ago: https://minimaxir.com/2018/10/data-science-protips/
Is it no longer the case that every programmer has to know SQL? I have always and still do consider knowledge of SQL a minimum requisite for any programming job...
Programming is now such a big field that plenty of people will go their whole careers without knowledge that's foundational for other programmers - whether it be SQL, linear algebra, memory management, distributed systems, ...
In a way, it's very cool. Instead of a single-dimensional "how good a programmer are you", we've got a multidimensional space. That, naturally, yields a lot more interesting corners, like, say, what could be invented by someone who knows a lot about databases and GPUs.
For a lot of systems, a lot of engineers/programmers often care very little about the "data", and they very often don't need to. E.g. on a high abstraction level your webserver doesn't care if it's sending HTML, JSON or whatever, and here we are just speaking about formats and not even higher semantics of the data.
I think there can be a lot of value in data-focused roles in companies, because there are many (hard) questions around data that are often badly answered as afterthoughts:
- What are the exact semantics of this data?
- How do make sure the data is clean?
- How do we evolve the data over time?
Doesn't the job title of "data scientist" still correlate with high pay, or has the hype started to wane?
Sure, well-thought out data structures were an integral part of good clean code, but the bulk of data was an artifact of software in the form of logs and largely unused database records, not the "center" of anything. Today, there are many people who take these logs and database records and run very specialized code on them. I entered industry in the mid 2000s and it was very difficult to find jobs that concentrated on this component back then. We're often called data scientists today, and there's orders of magnitude more attention that goes into it. This is a real shift that has brought data into the "center" of many organizations.
10 years ago they had a big initiative around “Analytics.” I’ve programmed and worked in data for a few decades so I tried to figure out why these biz people were interested in Analytics.
When I asked they brought up all the terms you said. I showed examples from various projects how they were just describing basic data programming. But they disagreed because this was new.
I figured it out a few years later when they spent $5B on the initiative and it was making money like gangbusters. This was now a product line. It was new that they were selling it.
So now there’s people who want more money and need to differentiate and thus the titles and whatnot. I wish there was an easier way to view work output of people because now the only way I can differentiate between “data engineers” and “data scientists” and “AI wizards” is by looking at their GitHub profiles or talking to them in depth and profiles are almost never complete or accurate and talking takes a lot of time.
In my experience most of the people doing what I consider data science have the titles “applied scientist” “machine learning engineer” “data engineer”.
After a new round of hiring, we came to the conclusion that "pure" computer scientists will not do the cut, even those with a phd, simply because we are quite convinced that they won't be able to understand the problem at hand.
This is a good summary of its meaning and history: https://en.wikibooks.org/wiki/Data_Science:_An_Introduction/...
Programmer -> Developer -> Software Engineer
Sysadmin -> Operations Engineer -> Site Reliability Engineer
QA Engineer -> Automation Engineer -> SDET
A C++ program with a couple megs on the heap is appropriate, just as an ETL pipeline is appropriate for gigabytes to terabytes.