Hacker News new | past | comments | ask | show | jobs | submit login

Given that GCP will happily supply VMs that can be resized to dozens of CPUs and hundreds of GB of RAM and back, billed by the minute, sampling bias isn't even a necessity for quite a lot of things a laptop wouldn't be able to handle. I used to write Spark jobs for those slightly-too-big data problems, but Pandas and Dask are quite sufficient for lots of things, without all the headache that distributed computing entails. Plus, data people have no need to store any potentially sensitive data on their personal machines, that's another headache less. It's not going to work well for petabyte-scale stuff, though. I guess for those kinda things and for periodic bespoke ETL/ELT jobs, Spark is still useful.





The ability to re-size GCP VMs totally blew me away when I first discovered it. Just power off the machine, drag the RAM slider way up and turn it back on. SOOO much easier than re-creating the VM on AWS.

I also way prefer to just crank up the RAM on a single instance and use Pandas/Dask instead of dealing with distributed computing headaches :)




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: