I like some of these technologies for taking care of the dragnet data collection and engineering, but, when I'm actually doing a data analysis, it's rare I want to run it on a Spark cluster; I'd much rather run a Spark job to collect the information I need and sample or digest it down to a size that I can hack on locally in R or Pandas. Yeah, there will be some sampling error, but the potential dollar cost associated with that sampling error is much lower than the cost to eliminate that sampling error. And it's basically zero compared to the elephant in the room: the sampling bias I'm taking on by using big data in the first place. "Big data" is, from a stats perspective, just the word for "census data" that people like to use in California.
Sure, the cloud providers sell computing power but it’s a race to the bottom and makes much more sense than buying hardware for these kinds of bursty analytics workloads.
I don’t think “Big data” from a stats perspective is analogous to census data - in some cases yes but for applications like recommendation engines you lose a lot of valuable signal by sampling.
I also way prefer to just crank up the RAM on a single instance and use Pandas/Dask instead of dealing with distributed computing headaches :)