
The Cost of Scalability in Graph Processing - ms705
http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html
======
scott_s
The GraphChi paper from OSDI 2012 made a similar observation:
[http://select.cs.cmu.edu/publications/paperdir/osdi2012-kyro...](http://select.cs.cmu.edu/publications/paperdir/osdi2012-kyrola-
blelloch-guestrin.pdf)

From the abstract: "In this work, we present GraphChi, a disk-based system for
computing efficiently on graphs with billions of edges. By using a well-known
method to break large graphs into small parts, and a novel parallel sliding
windows method, GraphChi is able to execute several advanced data mining,
graph mining, and machine learning algorithms on very large graphs, using just
a single consumer-level computer. ... By repeating experiments reported for
existing distributed systems, we show that, with only fraction of the
resources, GraphChi can solve the same problems in very reasonable time."

Section 7.2 compares themselves to distributed graph frameworks.

~~~
chimtim
There is another paper, XStream from SOSP 2013, which ran a facebook sized
graph on a single machine. I wonder what is "hot" in this HotOS submission?

~~~
frankmcsherry
The thrust of the post wasn’t meant to be that it is surprising that you can
even do these things on one core; sorry if it came across that way.

Rather, it’s that when doing the work on one core (with a way simple
implementation) you go faster than many of the measurements that popular
scalable systems put forward as evidence that they improve on the state of the
art. Which sort of invalidates that evidence. Some of the papers end up left
with fairly scant evidence, which should be a serious issue for researchers.

GraphChi and X-Stream
([http://infoscience.epfl.ch/record/188535](http://infoscience.epfl.ch/record/188535))
are much more sophisticated implementations, ones you might not be embarrassed
to be beaten by, so they don’t make the same point. And they aren’t any faster
than the systems above, and so have the same issues (though perhaps less
explaining to do).

~~~
PaulHoule
I had to propagate colors across a graph with a billion or so edges and had an
easy time doing it in-memory in Java with a 32GB laptop and sub-byte data
structures. In fact, it takes more time to serialize and deserialize the data
than it takes to do the actual calculation.

I'd say, however, the trend is towards data sets being much bigger and I break
non-scalable tools frequently in the data profiling process; you can extend
the non-scalable ways of doing it by using multiple cores (which is sometimes
easy) or SIMD instructions or the GPU. I have even sometimes gone down the
rabbit hole of optimizing something non-scalable and hitting the wall. So I am
using scalable systems increasingly.

------
zackmorris
I'm concerned that articles like this paint multiprocessing in a bad light.
Yes, there are issues like
[http://en.wikipedia.org/wiki/Amdahl's_law](http://en.wikipedia.org/wiki/Amdahl's_law)
but many real-world tasks are "embarrassingly parallel” and it takes little
effort to break them into segments that can be processed concurrently.

After skimming the article, I’m thinking the real bottleneck here is latency
since it mentions Hilbert curves. Currently networks are orders of magnitude
slower than memory but that won’t always be the case. A big game changer is
going to be content-addressable memory because we won’t have to worry about
network topology as much. It will work more like BitTorrent and locally cache
frequently-used data as needed.

Going forward, I have to admit that I’m not hugely fond of big data schemes as
they’re currently conceived. There is way too much emphasis on using strange
new databases and commodity hardware. I want just the opposite approach - low
level access to data with a language like Go or Rust and new hardware with
hundreds or thousands of cores on the same chip so we can get revolutionary
performance (like with Bitcoin ASICs). Then if we want to double performance,
we simple double the number of cores rather than hand-optimizing code, and
that is going to be huge for productivity.

~~~
revelation
Of course networks are always going to be orders of magnitude slower. That's
because we're already at the very point where sheer _distance_ becomes a
limiting factor to performance.

Remember Grace Hopper:

[https://www.youtube.com/watch?v=JEpsKnWZrJ8](https://www.youtube.com/watch?v=JEpsKnWZrJ8)

------
bane
Lots of people make the mistake of thinking there's only two vectors you can
go to improve performance, high or wide.

High - throw hardware at the problem, on a single machine

Wide - Add more machines

There's a third direction you can go, I call it "going deep". Today's programs
run on software stacks so high and so abstract that we're just now getting
around to redeveloping (again for like the 3rd or 4th time) software that
performs about as well as software we had around in the _1990s and early
2000s_.

Going deep means stripping away this nonsense and getting down closer to the
metal, using smart algorithms, planning and working through a problem and
seeing if you can size the solution to running on one machine as-is. Modern
CPUs, memory and disk (especially SSDs) are unbelievably fast compared to what
we had at the turn of the millenium, yet we treat them like they're spare
capacity to soak up even lazier abstractions. We keep thinking that completing
the task means successfully scaling out a complex network of compute nodes,
but completing the task _actually_ means processing the data and getting
meaningful results in a reasonable amount of time.

This isn't really hard to do (but it can be tedious), and it doesn't mean
writing system-level C or ASM code. Just seeing what you can do on a single
medium-specc'd consumer machine _first_ , then scaling up or out if you
_really_ need to. It turns out a great many problems really don't need
scalable compute clusters. And in fact, the time you'd spend setting that up,
and building the coordinating code (which introduces yet more layers that soak
up performance) you'd probably be better off just spending the same time to do
on a single machine.

Bonus, if your problem gets too big for a single machine (it happens), there
might be trivial parallelism in the problem you can exploit and now going-wide
means you'll probably outperform your original design anyways and the
coordination code is likely to be much simpler and less performance degrading.
Or you can go-high and toss more machine at it and get more gains with zero
planning or effort outside of copying your code and the data to the new
machine and plugging it in.

Oh yeah, many of us, especially experienced people or those with lots of
school time, are taught to overgeneralize our approaches. It turns out many
big compute problems are just big one-off problems and don't need a
generalized approach. Survey your data, plan around it, and then write your
solution as a specialized approach just for the problem you have. It'll likely
run much faster this way.

Some anecdotes:

\- I wrote an NLP tool that, on a single spare desktop with no exotic
hardware, was 30x faster than a 6-high-end-system-distributed-compute-node
that was doing a comparable task. That group eventually used my solution with
a go-high approach and runs it on a big multi-core system with as fast of
memory and SSD as they could procure and it's about 5 times faster than my
original code. My code was in Perl, the distributed system it competed against
was C++. The difference was the algorithm I was using, and not
overgeneralizing the problem. Because my code could complete their task in 12
hours instead of 2 weeks, it meant they could iterate every day. A 14:1
iteration opportunity made a huge difference in their workflow and within
weeks they were further ahead than they had been after 2 years of sustained
work. Later they ported my code to C++ and realized even further gains.
They've never had to even think about distributed systems. As hardware gets
faster, they simply copy the code and data over and realize the gains and it
performs faster than they can analyze the results.

Every vendor that's come in after that has been forced to demonstrate that
their distributed solution is faster than the one they already have running in
house. Nobody's been able to demonstrate a faster system to-date. It has saved
them literally tens of millions of dollars in hardware, facility and staffing
costs over the last half-decade.

\- Another group had a large graph they needed to conduct a specific kind of
analysis on. They had a massive distributed system that handled the graph, it
was about 4 petabytes in size. The analysis they wanted to do was an O(N^2)
analysis, each node needed to be compared potentially against each other node.
So they naively set up some code to do the task and had all kinds of exotic
data stores and specialized indexes they were using against the code. Huge
amounts of data was flying around their network trying to run this task but it
was slower than expected.

An analysis of the problem showed that if you segmented the data in some
fairly simple ways, you could skip all the drama and do each slice of the task
without much fuss on a single desktop. O(n^2) isn't terrible if your data is
small. O(k+n^2) isn't much worse if you can find parallelism in your task and
spread it out easily.

I had a 4 year old Dell consumer level desktop to use so I wrote the code and
ran the task. Using not much more than Perl and SQLite I was able to compute a
large-ish slice of a few GB in a couple hours. Some analysis of my code showed
I could actually perform the analysis on insert in the DB and that the size
was small enough to fit into memory so I set SQLite to :memory: and finished
it in 30 minutes or so. That problem solved, the rest was pretty
embarrassingly parallel and in short order we had a dozen of these spare
desktops occupied running the same code on different data slices and finishing
the task 2 orders of magnitude than what their previous approach had been.
Some more coordinating code and the system was fully automated. A single
budget machine was theoretically now capable of doing the entire task in 2
months of sustained compute time. A dozen budget machines finished it all in a
week and a half. Their original estimate on their old distributed approach was
6-8 months with a warehouse full of machines, most of which would have been
computing things that resulted in a bunch of nothing.

To my knowledge they still use a version of the original Perl code with SQlite
running in memory without complaint. They could speed things up more with a
better in-memory system and a quick code port, but why bother? It's completing
the task faster than they can feed it data as the data set is only growing a
few GB a day. Easily enough for a single machine to handle.

\- Another group was struggling with handling a large semantic graph and
performing a specific kind of query on the graph while walking it. It was ~100
million entities, but they needed interactive-speed query returns. They had
built some kind of distributed Titan cluster (obviously a premature
optimization).

Solution, convert the graph to an adjacency matrix and stuff it in a
PostgreSQL table, build some indexes and rework the problem as a clever
dynamically generated SQL query (again, Perl) and now they were realizing
.01second returns, fast enough for interactivity. Bonus, the dataset at 100m
rows was tiny, only about 5GB, with a maximum table-size of 32TB and diskspace
cheap they were set for the conceivable future. Now administration was easy,
performance could be trivially improved with an SSD and some RAM and they
could trivially scale to a point where dealing with Titan was far into their
future.

Plus, there's a chance for PostgreSQL to start supporting proper scalability
soon putting that day even further off.

\- Finally, a e-commerce company I worked with was building a dashboard
reporting system that ran every night and took all of their sales data and
generated various kinds of reports, by SKU, by certain number of days in the
past, etc. It was taking 10 hours to run on a 4 machine cluster.

A dive in the code showed that they were storing the data in a deeply nested
data structure for computation and building and destroying that structure as
the computation progressed was taking all the time. Furthermore, some metrics
on the reports showed that the most expensive to compute reports were simply
not being used, or were being viewed only once a quarter or once a year around
the fiscal year. And cheap to compute reports, where there were millions of
reports being pre-computed, only had a small percentage actually being viewed.

The data structure was built on dictionaries pointing to other dictionaries
and so-on. A quick swap to arrays pointing to arrays (and some
dictionary<->index conversion functions so we didn't blow up the internal
logic) transformed the entire thing. Instead of 10 hours, it ran in about 30
minutes, on a single machine. Where memory was running out and crashing the
system, memory now never went above 20% utilization. It turns out allocating
and deallocating RAM actually takes time and switching a smaller, simpler data
structure makes things faster.

We changed some of the cheap to compute reports from being pre-computed to
being compute-on-demand, which further removed stuff that needed to run at
night. And then the infrequent reports were put on a quarterly and yearly
schedule so they only ran right before they were needed instead of every
night. This improved performance even further and as far as I know, 10 years
later, even with huge increases in data volume, they never even had to touch
the code or change the ancient hardware it was running on.

It seems ridiculous sometimes, seeing these problems in retrospect, that the
idea was that to make these problems solvable racks in a data center, or
entire data centeres were ever seriously considered seems insane. A single
machine's worth of hardware we have today is almost embarrassingly powerful.
Here's a machine that for $1k can break 11 _TFLOPS_ [1]. That's insane.

It also turns out that most of our problems are not compute speed, throwing
more CPUs at a problem don't really improve things, but disk and memory are a
problem. Why anybody would think shuttling data over a network to other nodes,
where we then exacerbate every I/O problem would improve things is beyond me.
Getting data across a network and into a CPU that's sitting idle 99% of the
time is not going to improve your performance.

Analyze your problem, walk through it, figure out where the bottlenecks are
and fix those. It's likely you won't have to scale to many machines for most
problems.

I'm almost thinking of coming up with a statement: Bane's rule, you don't
understand a distributed computing problem until you can get it to fit on a
single machine first.

1 -
[http://www.freezepage.com/1420850340WGSMHXRBLE](http://www.freezepage.com/1420850340WGSMHXRBLE)

~~~
NhanH
I'd second the jacquesm's post: you should write a book or a blog post on
those.

One of the reason I've noticed for everyone's wanting to use "distributed" and
"cluster" stuffs is that we have no intuition/ experience on how much can be
processed within the limit of a single machine: when someone start designing a
data pipeline, even if they know how big (in term of GBs/TBs/ whatever
criteria it is) the dataset is, they still don't know if it can fit in a
machine or not. So the safe solution is to design a distributed system: if it
doesn't fit, you throws more hardware at it. It's somewhat similar to the "no
one got fired for buying IBM"

~~~
jerf
We probably need to collectively start doing the same thing that people do
when they ask "How can I speed up my code?"... "Profile, profile, profile."

"How can I make my code cloud-scale?" "Profile, profile, profile." First make
it fast. Not hyper-ultra fast, optimized to within an inch of its life with
embedded ASM and crazy data structures, but as far as you can get it while
still writing simple, sensible code. Then, only if you have a problem do you
even worry about whether it should go cloud-scale, and by the time you're done
with this you'll probably already have a good idea how to partition the
problem better because you learned a lot more about it while profiling.

------
taeric
This raises so many questions that I can't even really comprehend that it
worries me. Please, if there are well formed responses or dialogues that form
based on this, somebody make sure to link them back. Very very interesting
read.

~~~
arjunnarayan
The basic point is: You think your dataset is big, but it probably isn't. A
well written C program running on a modern laptop can handle a few billion
node sized graph dataset fine. Before paying for 128 cores running a
"scalable" implementation, first check your dataset size. Is it in the
terabytes? If not, you're probably fine running any computations on a single
core.

Google and Facebook need cluster computing solutions like MapReduce and
Spanner. You most likely don't, and are years away from needing one. Even a
successful startup like SnapChat or Tinder could probably do all their
analytics/graph processing needs on a single core. So worry about the other
things first.

~~~
herdrick
In the article he's dealing with a graph of billions of edges - probably not
billions of nodes. That's been my experience, too, with loading a graph in
memory on a laptop: billions of edges and hundreds of millions of nodes is
fine but getting close to being too much.

~~~
taeric
Aren't most of these algorithms limited by the edges?

------
xtacy
While I understand the sentiment behind this post, I think it misses one
crucial point: It costs time, effort, and very smart people to build the
"Bugati"-like system as they describe, instead of the current systems (that
are more like "Toyotas", to name one).

I haven't seen the paper yet, so I can't be sure, but I think the numbers
might ignore many factors: First, you need some kind of abstract, exchangeable
storage (e.g., protobufs) to work with the data in many languages. Third,
there's the file-system and all its intricacies. Fourth, it's unlikely that
any compute environment will be dedicated only to one application (there's
scheduling, resource management, and all that, which means there are hidden
costs to doing network IO due to contention, protocol quirks, etc.). And
finally, any realistic application is more than just "solving" the problem in
the fastest way possible. Requirements change all the time, new features will
be added, the code needs to be readable, understandable, maintainable, etc.

It's possible to do all the above AND be super efficient, but it requires a
tremendous level of understanding of a system at all levels that it can be
quite challenging, and frankly, with business requirements, it's probably not
worth the time. If there's a framework that gives you abstraction but compiles
to the fastest possible specific implementation AND makes a programmer
productive, I would love to read up more!

~~~
ploxiln
I've worked at a place had a good excuse for using a real "big data"
processing system - a large hadoop cluster on bare metal - because the dataset
was way bigger than would fit on a single machine. The cluster had over 20
machines that had 10 or so 3TB disks each, and a large computation might use
half the dataset. Even if you could put all the data in a storage appliance
and read it from a single machine, you needed the separate memories and data
busses to just sift through the data and keep the relevant bits in non-swapped
memory in less than a day.

I think the important lesson from this paper, and which a few researchers also
learned from our cluster, is that the amount of inefficiency that can be in
software, and then removed by competent programming, is astronomical these
days. You see many arguments like yours - programmer time is expensive, you
need to be a super expensive expert, just pay for the systems, blah blah... it
underestimates the cost of the ridiculous inefficiency, and overestimates the
cost of competency.

Even scientists in-experienced with serious programming can often get
appreciably better at writing their data processing jobs _before their first
job finishes_ when you're dealing with the really big data. What a lot of
people call big data isn't even big data. They'd rather go through the motions
of setting up and using a big-data processing system and using it poorly, than
learn better software engineering skills, even if that would take less of
(theirs + others) time, amortized over the next few months of their work.

This isn't toyota vs bugatti. This is... freight train with conductor,
engineer, station staff, and one car of payload... vs getting a drivers
license for a large van.

------
erdevs
Perhaps another takeaway from this article is that it'd be nice if more
research papers benchmarked with vastly larger data sets.

Many researchers would love to do just that, of course. However, as many
researchers will lament, it's not always easy to get 10+-figure node and
11+-figure edge data sets appropriate to a space being explored.

The best research benchmarks I see do compare to a single-core (often
multithread or multiprocess) implementation. And they also show benchmark
results on datasets of increasing sizes.

I agree with the authors that those sorts of papers aren't common enough,
though. And we should strive to do better. Moreover, I agree that in practice
in industry, many people over-optimize for horizontal scalability early,
and/or do not realize potential savings and benefit by doing vertical
optimizations after gaining initial scale.

------
chimtim
If the dataset and/or computation fits in your laptop, why would you use a
cluster framework?

If you want to use multi-core, why would you not use the pthread library
instead of using Spark, GraphX etc. The authors never show a pthread
comparison.

The article shows just one algorithm for toy datasets. For certain algorithms
such as stochastic gradient descent, multiple-cores can process data in
parallel and go through the data very quickly. Again, if everything fits in
memory, doing a gradient descent on entire dataset will be much faster and
give a better quality result. This fact is pretty much well known to end-users
i.e. folks who actually try to solve big-data problems.

However, most papers use small datasets like twitter or MNIST because their
convergence behavior is well-understood (rather than to demonstrate scaling).

------
wrexsoule
> In many cases, you’d be better off running the same computation on your
> laptop.

Stopped reading there. If you're better off running the same computation on
your laptop, you're just not dealing with real big data and so you don't need
any distributed systems. Simple as that.

~~~
scott_s
That's his whole point. But he goes further, by showing via experiment that a
lot of distributed systems are solving problem sizes in benchmarks that can be
solved quicker in a single thread.

