Hacker News new | past | comments | ask | show | jobs | submit login

Is 100 Billion (order of a few TB) Big Data?

In my experience, CPU is rarely the big issue when dealing with a lot of data (I am talking about tens of PB per day). IO is the main problem and designing systems that move the least amount of data is the real challenge.




No it is not, not even close. At a job 10 years ago nearly, we had 50Tb in plain old-fashioned Oracle, and we knew people with 200Tb in theirs (you would be surprised who if I told you). A few Tb these days, you could crunch quite easily on a high-end desktop or a single midrange server, using entirely conventional techniques.


(Toby from Localytics here)

Yep, this is a great point. The data locality/reducing IO is huge, but the way things actually play out for us when data isn't segmented/partitioned properly, it chews up CPU/memory. This is a lot of why the post was geared around CPU usage: concurrency in Vertica can be a little tricky, and stabilizing compute across the cluster has paid more dividends than any storage or network subsystem tweaks we've made.

We're not at the PB/day mark, though, so there's definitely classes of problems we are blissfully ignorant on. :)


The major takeaway I had from my courses in data intensive applications, was that IO is all that matters. It is the limiting factor to such an extend that you don't really care about the algorithmic efficiency with regards to CPU calculations, or memory.

You analyse algorithms in terms of IO access, and specifically access pattern. If you cannot make the algorithm in a scanning fashion, you're in for a bad time.


There is always a balance here between CPU and IO. For a long time databases and big data platforms were pretty terrible with IO. However, as the computer engineering community has had time to work with these problems we have gotten considerably better at understanding how to store data via sorted and compressed columnar formats how to exploit data locality via segmentation and partitioning. As such most well constructed big data products are CPU bound at this point. For instance check out the NSDI `15 paper on Spark performance that found it was CPU bound. Vertica is also generally CPU bound.

https://www.usenix.org/conference/nsdi15/technical-sessions/...


After skimming the paper, I'm fairly confident it's not the same at all. We only managed the theoretical side of a scenario where there would be multiple TB hard drives, on multiple machines. Any efficient algorithm would work in a scanning manner, and not seek backwards beyond what could be kept in ram. We did simulate this, and the result was quite clear, IO matters.

From the paper the following 3 quotes highlight exactly why they where CPU bound:

> We found that if we instead ran queries on uncompressed data, most queries became I/O bound

> is an artifact of the decision to write Spark in Scala, which is based on Java: after being read from disk, data must be deserialized from a byte buffer to a Java object

> for some queries, as much as half of the CPU time is spent deserializing and decompressing data


Specifically network traffic, rather than disk read/write.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: