If you don't want to read everything (but you should, before trusting the conclusions), the relevant measurements are here:
Edit: Reading the paper, they cite these measurements and present their work as "only [..] 8.8× slower than a distributed in-memory" system (GraM, a different system than timely dataflow). The quote is on the first page.
People also might want distributed systems for other reasons - fault tolerance is one. So as your SLA requirements go higher, you sometimes end up with little choice but to pay that cost. Other times you don't.
They do cite challenges with scaling this up, but I think quite a bit of their innovation is around just leveraging hardware better, which is a big enough deal on its own.
EDIT: Looks like this is already a thing 
Single machine is only faster on problems that cannot be parallelized well.
Is this true in the general case, or what?
^ (Pregel is Google's batch graph processing framework)
FWIW most graph computations can fit on a single powerful machine these days. There really aren't that many that require distributed processing.
If you're trying to divide up the work by dividing the graph into parts that run on different nodes, where do you cut it? Even finding a good cut is a graph algorithm in itself. And any information that flows across that cut has to be serialized and deserialized, which is relatively quite slow.
Meanwhile, a single-node algorithm is usually hitting results that are in RAM or even in cache.
Can the title be edited?