Note to mods: The title is correctly written “COST”; it’s short for “Configuration that Outperforms a Single Thread”, the central concept being discussed.
That’s one of the things that stuck with me after haven taken a class in high performance computing. What matters is the absolute performance, less the speed up one achieves by parallelisation.
Still, it wasn't that helpful from a business perspective, because apparently it is easier to justify mark up on solutions with lots of servers.
The area we were in, customers could tolerate some outage. Restoring from a backup didn't take that long. I believe we were known for good support helping people get back up.
But yeah, what you are saying is all the arguments that kept coming to us. People were used to having a few servers for the database, and different front ends, and layering things like that.
Cluster computing can be useful, but until you're talking about petabytes of data, it probably isn't helping you
"Research demonstrated that there is no benefit to using a graph-centric execution engine and storage manager" - which I take to be the kind of systems that the paper is critical of.
Which I suppose graph execution engines should outperform a single threaded "“think like a vertex" problem.
Which links to this paper The case against specialized graph analytics engineshttp://cidrdb.org/cidr2015/Papers/CIDR15_Paper20.pdf
They use a relational database to outperform dedicated graph databases.
Is part of the takeaway "hand coded single threaded think like a vertex" beats distributed system which involves communication or parallelisation.
The COST paper is talking about the situation where you have 10 units of work, but it takes 10 more units of work to distribute it across multiple computers.
Sometimes that's a win and sometimes it isn't, thus the phrase "parallelizing your overhead". If all you did was parallelize the additional overhead, then you didn't gain anything by using a distributed system.
IOW, the overhead for distribution in distributed systems can be very large. It's still there in parallel computing, but shared memory makes it less pronounced. (Although NUMA makes your single computer like a distributed system again.)
That's just serialized work in a different form: in order for a computer to do a unit of work, it must be sent a unit of work. That's strictly serial -- a computer can't process a unit of work it doesn't have. So no matter how many units of work it takes to distribute the load, that's still 10 serial operations.
Amdahl wasn't only seeing things as separate chunks, with the parallelized work over one part of the code, and the serial work in a different part. The operations can be interleaved.
https://en.m.wikipedia.org/wiki/Bulk_synchronous_parallel is a useful model for incorporating communication costs into design vs. PRAM.
One might even say that the cost of implementing your algorithm with BSP is higher than PRAM, because of the extra layers. But you can't ignore some of the things that you can in a strict PRAM model, so you have to incorporate those into the cost as well.
Given Gustafson's law, if you have enough data, enough to work to do, you can sort of ignore the un-parallelizable fraction by throwing enough processors at the computation.
What starts to trip things up at this scale is synchronization.
But the reason why I laughed is that for almost every bug reported, it starts with a minor war about whose service that is really responsible, and then how was this contract really defined again? Oh, so the only specification is this example file you sent us two years ago? Aha, so the public order number isn’t what is called order_number, we should obviously have used the CustONr. Et c...
In theoretical CS classes there were discussions of the tradeoffs between networked, NUMA, and other fabrics. Analysis of what actually ran the fastest was talked about briefly beyond Big O notation, but there is a definite advantage to making problems tractable that otherwise wouldn't be. In the FAANGs it was mostly embarrassingly parallel algorithms with a skin of distributed computing for synchronization/coordination, and so the focus had always been on absolute speed or efficiency.
From the paper it's not clear how much of the overhead is fundamental and how much is due to the frameworks being just terrible (many of them are written in Java, so they can't be good since no decent programmer would use Java for efficient algorithmic code instead of Rust or C++ before Rust was available).