Hacker News new | past | comments | ask | show | jobs | submit login

The big difference is that historically you could run pretty much all analysis on a single computer (if not in Excel - I’ve seen and built many actuarial models in Excel myself) but consumer internet scale data (both in terms of the number of rows and the number of features available) is too big to model without distributed computing and more programming work on the part of the analyst.

Moreover it’s an iterative process where you start with hypotheses on what features will be useful then build a model and might need to pull more data. For example you could build a recommendation engine just using the set of products purchased by customers but then might explore adding user agent features like devices or might add data about the products like category or price information.

An analyst (“data scientist”) who can pull and interact with that data is much more effective than one who needs to wait for someone else to pull the data for them (most actuaries I know are weak programmers outside of R and wouldn’t know where to begin using distributed computing or getting a TensorFlow gridsearch running on a cluster of cloud GPUs :)

I actually think that analytics clusters is one of those wild west things. The statistician in me is suspicious of a lot of the Big Data movement, for various reasons that could largely be summarized as, "I'm pretty sure these ways of working are motivated less by any true analytics need and more by the needs of cloud services providers to sell cloud services."

I like some of these technologies for taking care of the dragnet data collection and engineering, but, when I'm actually doing a data analysis, it's rare I want to run it on a Spark cluster; I'd much rather run a Spark job to collect the information I need and sample or digest it down to a size that I can hack on locally in R or Pandas. Yeah, there will be some sampling error, but the potential dollar cost associated with that sampling error is much lower than the cost to eliminate that sampling error. And it's basically zero compared to the elephant in the room: the sampling bias I'm taking on by using big data in the first place. "Big data" is, from a stats perspective, just the word for "census data" that people like to use in California.

I totally get your skepticism but the bottom line is that throwing computing power at a problem generally leads to a much better solution even if only because you can do a much more extensive grid-search. This doesn’t have to be using some complex cluster like you could just have n copies of the same VM that each run experiments with parameters pulled from a Google sheet until all the experiments are done.

Sure, the cloud providers sell computing power but it’s a race to the bottom and makes much more sense than buying hardware for these kinds of bursty analytics workloads.

I don’t think “Big data” from a stats perspective is analogous to census data - in some cases yes but for applications like recommendation engines you lose a lot of valuable signal by sampling.

Given that GCP will happily supply VMs that can be resized to dozens of CPUs and hundreds of GB of RAM and back, billed by the minute, sampling bias isn't even a necessity for quite a lot of things a laptop wouldn't be able to handle. I used to write Spark jobs for those slightly-too-big data problems, but Pandas and Dask are quite sufficient for lots of things, without all the headache that distributed computing entails. Plus, data people have no need to store any potentially sensitive data on their personal machines, that's another headache less. It's not going to work well for petabyte-scale stuff, though. I guess for those kinda things and for periodic bespoke ETL/ELT jobs, Spark is still useful.

The ability to re-size GCP VMs totally blew me away when I first discovered it. Just power off the machine, drag the RAM slider way up and turn it back on. SOOO much easier than re-creating the VM on AWS.

I also way prefer to just crank up the RAM on a single instance and use Pandas/Dask instead of dealing with distributed computing headaches :)

I'm pretty suspicious of the idea that most companies have data that is too big for a non-distributed environment.

I do think that most companies log lots of data in different places without much thought or structure, and as a result need a "data scientist", who spends most of their time simply aggregating and cleaning data that can then easily be processed on a single machine. ie: they spend 10% of their time doing statistics.

Maybe I'm wrong, but the above is how all the data scientists that I know describe their work.

In my experience it's a bit of both.

Most of a data scientist's time is definitely spent getting data from various places, possibly enriching it, cleaning it, and converting it into a dataframe that can then be modeled (and surprisingly a lot of those tasks are often beyond pure actuaries / statisticians who often get nervous writing SQL queries with joins nevermind Spark jobs)...

But also once you've got that dataframe, it's often too big to be processed on a single machine in the office so you need to spin up a VM and move the data around which also requires a lot more computer science / hacking skills than you typically learn in a stats university degree so I think the term data scientist for someone who can do the computer sciency stuff in addition to the stats has its place...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact