Hacker News new | comments | show | ask | jobs | submit login

Good luck finding anything public about graph processing on a dataset too large to fit on a single machine. I can launch an AWS instance with 128 cores and 4 TB RAM--how many triples is too many for that monster? Tens of billions? Hundreds of billions?



The common crawl has 3.28 billion pages. That should result in plenty of edges for 2017. (Shouldn't it?)


That's true. Thanks for reminding me! Next time I need to torture a graph store I'll try loading common crawl into it.


Yes. At 4$/hr these instances are truly mind blowing.

BUT

With only 2TB of local storage, your 4TB dataset (or output set) has to pass over the net for every execution. At 10Gbs - this alone can take 20 minutes to 1hr.

If you want to write it to the local SSD - multiply that by x3 or x10 ?! ....

This 4TB of memory is still bounded by 4 x 30MB of L3 cache. Which means your single thread implementation will be slowed down by x3 or x5 due to memory latency.

Your multi core implementation will probably suffer even more.

Distributed system are VERY hard, but dealing with them is inevitable for certain workloads.


I was referring to x1e.32xlarge..2x1920GB ephemeral, 25 Gbps Ethernet + 14 Gbps EBS Ethernet. And at least for the application I'm considering, the working set would fit in RAM and you can run the algorithm online. Score!

That's what I mean. You have to get really big to have Big Data. Fine, you can't do PageRank on a petabyte of web crawls using one machine. But the datasets that people use for benchmarking, at least the benchmarks that are made public, you definitely can. You can go far larger than a typical benchmark dataset and still do it on one machine.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: