Moreover it’s an iterative process where you start with hypotheses on what features will be useful then build a model and might need to pull more data. For example you could build a recommendation engine just using the set of products purchased by customers but then might explore adding user agent features like devices or might add data about the products like category or price information.
An analyst (“data scientist”) who can pull and interact with that data is much more effective than one who needs to wait for someone else to pull the data for them (most actuaries I know are weak programmers outside of R and wouldn’t know where to begin using distributed computing or getting a TensorFlow gridsearch running on a cluster of cloud GPUs :)
I like some of these technologies for taking care of the dragnet data collection and engineering, but, when I'm actually doing a data analysis, it's rare I want to run it on a Spark cluster; I'd much rather run a Spark job to collect the information I need and sample or digest it down to a size that I can hack on locally in R or Pandas. Yeah, there will be some sampling error, but the potential dollar cost associated with that sampling error is much lower than the cost to eliminate that sampling error. And it's basically zero compared to the elephant in the room: the sampling bias I'm taking on by using big data in the first place. "Big data" is, from a stats perspective, just the word for "census data" that people like to use in California.
Sure, the cloud providers sell computing power but it’s a race to the bottom and makes much more sense than buying hardware for these kinds of bursty analytics workloads.
I don’t think “Big data” from a stats perspective is analogous to census data - in some cases yes but for applications like recommendation engines you lose a lot of valuable signal by sampling.
I also way prefer to just crank up the RAM on a single instance and use Pandas/Dask instead of dealing with distributed computing headaches :)
I do think that most companies log lots of data in different places without much thought or structure, and as a result need a "data scientist", who spends most of their time simply aggregating and cleaning data that can then easily be processed on a single machine. ie: they spend 10% of their time doing statistics.
Maybe I'm wrong, but the above is how all the data scientists that I know describe their work.
Most of a data scientist's time is definitely spent getting data from various places, possibly enriching it, cleaning it, and converting it into a dataframe that can then be modeled (and surprisingly a lot of those tasks are often beyond pure actuaries / statisticians who often get nervous writing SQL queries with joins nevermind Spark jobs)...
But also once you've got that dataframe, it's often too big to be processed on a single machine in the office so you need to spin up a VM and move the data around which also requires a lot more computer science / hacking skills than you typically learn in a stats university degree so I think the term data scientist for someone who can do the computer sciency stuff in addition to the stats has its place...