Hacker News new | past | comments | ask | show | jobs | submit login

I actually think that analytics clusters is one of those wild west things. The statistician in me is suspicious of a lot of the Big Data movement, for various reasons that could largely be summarized as, "I'm pretty sure these ways of working are motivated less by any true analytics need and more by the needs of cloud services providers to sell cloud services."

I like some of these technologies for taking care of the dragnet data collection and engineering, but, when I'm actually doing a data analysis, it's rare I want to run it on a Spark cluster; I'd much rather run a Spark job to collect the information I need and sample or digest it down to a size that I can hack on locally in R or Pandas. Yeah, there will be some sampling error, but the potential dollar cost associated with that sampling error is much lower than the cost to eliminate that sampling error. And it's basically zero compared to the elephant in the room: the sampling bias I'm taking on by using big data in the first place. "Big data" is, from a stats perspective, just the word for "census data" that people like to use in California.

I totally get your skepticism but the bottom line is that throwing computing power at a problem generally leads to a much better solution even if only because you can do a much more extensive grid-search. This doesn’t have to be using some complex cluster like you could just have n copies of the same VM that each run experiments with parameters pulled from a Google sheet until all the experiments are done.

Sure, the cloud providers sell computing power but it’s a race to the bottom and makes much more sense than buying hardware for these kinds of bursty analytics workloads.

I don’t think “Big data” from a stats perspective is analogous to census data - in some cases yes but for applications like recommendation engines you lose a lot of valuable signal by sampling.

Given that GCP will happily supply VMs that can be resized to dozens of CPUs and hundreds of GB of RAM and back, billed by the minute, sampling bias isn't even a necessity for quite a lot of things a laptop wouldn't be able to handle. I used to write Spark jobs for those slightly-too-big data problems, but Pandas and Dask are quite sufficient for lots of things, without all the headache that distributed computing entails. Plus, data people have no need to store any potentially sensitive data on their personal machines, that's another headache less. It's not going to work well for petabyte-scale stuff, though. I guess for those kinda things and for periodic bespoke ETL/ELT jobs, Spark is still useful.

The ability to re-size GCP VMs totally blew me away when I first discovered it. Just power off the machine, drag the RAM slider way up and turn it back on. SOOO much easier than re-creating the VM on AWS.

I also way prefer to just crank up the RAM on a single instance and use Pandas/Dask instead of dealing with distributed computing headaches :)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact