
COST in the land of databases - jchanimal
https://github.com/frankmcsherry/blog/blob/master/posts/2017-09-23.md
======
jandrewrogers
Tangentially, graph platforms have _long_ had the issue that performance
claims are astonishingly unreliable and come with so many caveats and
qualifications (documented or not) that the claims don't generalize usefully
in practice. One of the reasons I got out of the graph database business was
the pervasive lack of rigor and repeatability when it comes to measuring
scalability and performance across platforms; there is a cost asymmetry to
properly addressing the volume of questionable performance measurements made
in the graph database world.

In fairness, graph workload performance, compared to other types of workloads,
is unusually sensitive to a diverse set of details that are rarely captured
adequately by researchers. As often as not, you are actually measuring things
unrelated to the graph algorithm itself. Measuring graph databases properly
would require a level of specification for how to measure such things that
hasn't happened yet. And many people like it that way, it makes it relatively
easy to contrive details that make your particular algorithm or implementation
look good despite the reality that serious weaknesses are pervasive across
graph database platforms. A generally, objectively strong graph database
platform doesn't exist AFAIK -- they all kind of suck in unique ways.

~~~
eternalban
A graph is a mathematical object. It has well defined properties, such as
'degree', 'radius', etc.

Just as in e.g. relational domain you can propose well defined operations
(such as join two relations based on a common foreign key) and devise
benchmarks for these operations, it should be possible to do the same precise
thing for graph databases.

For a graph database, you would want (a) ingress, (b) comps, and (c) recovery
benchmarks. The (b) comps per above can be based on a set of common patterns
of queries applied to a parametric graph spec.

> In fairness, graph workload performance, compared to other types of
> workloads, is unusually sensitive to a diverse set of details that are
> rarely captured adequately by researchers.

That reads: we don't understand graphs.

------
mjb
That was a good read, but I think the author's conclusion is a little bit too
punchy (although it did make it a fun read).

Figuring out how to scale-out non-trivial problems is difficult. Many of these
scale-out solutions are going to be worse than local processing, and many
won't add meaningful value at any scale. However, there's real value in
approaching difficult problems, and there's real value in "the fastest way to
do X in a cluster" even if it's not "the fastest way to do X". That's because
scale-out is a real requirement to do a lot of interesting things, and because
if we can find better ways to scale out we can start to compete with single-
machine solutions (and "custom cluster" and "supercomputer" solutions) then we
have made real progress towards something that's faster at all scales. It's a
valuable research goal.

Now, it would be great if the database community (and the systems research
community at large) was more explicit about that. They should set the standard
that authors need to clearly differentiate between "improved on result X under
constraints Y" and "moved the state-of-the-art forward under all constraints".

It's also good to read systems research skeptically, and think about whether
you really need to pay the cost of scale-out. For companies who's output isn't
primarily "research", it makes a lot of sense to constantly look for "the
cheapest way to do X", not "the cheapest way to do X in a cluster".

~~~
lambda
Note that the author himself is working on a framework for distributed
differential dataflow computations (dataflow computations in which it is very
cheap to make small updates to the inputs and see the differences in the
outputs), and he has plans to apply the same metrics (comparing to a single-
threaded implementation) to his own work once he gets a little further:

[https://github.com/frankmcsherry/blog/blob/master/posts/2017...](https://github.com/frankmcsherry/blog/blob/master/posts/2017-03-28.md)

I think that answering the "fastest way to do X in a cluster" question is only
particularly interesting if X is something that you need to do in a cluster.
But clearly they are testing X that is capable of being done on a mid-range
laptop. There's a real question of whether the X that they are testing will
actually scale to larger datasets than the laptop (or a decently beefy single
node) could handle; it's entirely possible that the cluster solution will blow
up at that point. And of course, if the single-threaded implementation is
faster, then you could presumably just do "X in a cluster" by doing "X on a
single node" and using the rest of the nodes to mine your CPU bound
cryptocurrency of choice.

~~~
kbenson
> And of course, if the single-threaded implementation is faster, then you
> could presumably just do "X in a cluster" by doing "X on a single node" and
> using the rest of the nodes to mine your CPU bound cryptocurrency of choice.

As the author hi,self points out, amusingly.

 _These numbers are for sure more interesting. SEED does better than any
number I can actually achieve on my laptop. That demonstrates non-triviality,
in that their system isn 't just a laptop computing the answer and 156 cores
mining bitcoin._

I really enjoyed the snark in this, and the series of blog posts he links in
the beginning. I tend to be nice to a fault myself unless someone has shown
themselves to not be acting in kind (and even then I confirm), but I admit I
take quite a bit of voyeuristic pleasure in people being assholes in smart and
interesting ways. Vicariously living and all that, etc...

------
mcguire
Haven't read the article yet, but these slides (from HOTOS 2015) are
beautiful: [PDF]
[https://www.usenix.org/sites/default/files/conference/protec...](https://www.usenix.org/sites/default/files/conference/protected-
files/hotos15_slides_mcsherry.pdf)

That article is [https://www.usenix.org/conference/hotos15/workshop-
program/p...](https://www.usenix.org/conference/hotos15/workshop-
program/presentation/mcsherry)

------
dswalter
For anyone who is just passing by, the article attempts to verify some claims
from recent VLDB and SIGMOD papers that claim decent* performance on graph
algorithms in RDBMs. It's worth reading not only for the technical merits of
the discussion, but also because the style of the article is ...feisty.

*for a given definition of decent.

~~~
mcguire
There is one fundamental step you should always take when reading systems
papers with benchmarks: _Always look for what the authors are trying to hide._

It's (a) always there, and (b) always hilarious. From operating systems
comparisons that need "we didn't actually slow down the processor physically"
comparisons to look good to these "we look great, compared to that guy over
there licking the floor."

------
harveywi
I love this quote by Paul Barham:

"You can have a second computer once you've shown you know how to use the
first one."

~~~
elvinyung
Is there actually a source for this? The COST paper uses it, but I still
haven't found what it's quoting.

------
incan1275
He has some great points about SIGMOD/VLDB reviewing - they emphasize far too
much on "algorithmic novelty", instead of impactful simple systems. However,
in terms of graph processing, I too think in many instances, single machine
processing is too limiting.

In general, I think people would be more convinced if there were real cases in
industry where engineers switched from distributing graph algos to single-
machine algos because they found performance benefits were better. In all my
years, I haven't heard of a case like this (it's always been the other way
around).

------
aub3bhat
The author should spend some time delivering solutions to "real problems".
It's easy to be snarky while writing a blogpost with toy datasets but it's
difficult to actually deliver production solution in an organization. The real
COST is often not the time consumed but costs that arise from maintaining
heterogeneous architecture and remaining legally compliant.

A database or hadoop cluster is not just an execution engine, but rather a
managable, reproducible execution engine. Sure you can code up a smarter
algorithm on your personal i9 workstation. However having something available
as part of DB or Hadoop ecosystem enables reproducible execution while taking
care of compliance (e.g. privacy), security, access control etc.

Eventually we will have a containerized extensions to DB query execution flow
that will allow us to have best of both worlds but for 99% practicing data
scientists downloading a Gigabyte sized subset to their own workstation is not
a viable option.

The real opportunity lies in ability to add arbitrary compute/memory capacity
to DB query execution flows.

~~~
geofft
That's absolutely true, but I think the author's argument is likely to be that
the popularity of Spark and Hadoop in the first place is misguided, and in a
world where people were paying attention to performance instead of novelty,
we'd have something like Spark and something like Hadoop that was optimized
for a single computer, not for scaling for the fun of scaling.

~~~
aub3bhat

       scaling for the fun of scaling.
    

I think your arguments are misguided, for every compute bound task where
Hadoop/Spark undeperform, 1000 other ETL type tasks where hadoop is
indispensable. As a result any organization running a large hadoop cluster
will already have underused compute capacity for free and a maintainance staff
which already taking care of the cluster. Thus from an organizations
perspective the time difference is not material especially for batch jobs,
this is the reason why Presto and Spark have been so successful. They enabled
underutilized hadoop clusters to be used for ML and data science while
delivering reasonable performance for Zero cost.

~~~
Terretta
> _”... Staff already taking care of_ the _cluster.”_

“The” cluster, singular ...

This is the catch. If you need security and compliance, you can’t today derive
the benefits of re-use by other teams.

Given today’s distros, and assuming your threat model needs to account for
insider threat, you need a different cluster for each data ownership grouping
and data sensitivity level. For four teams with three levels of data, you’d
need twelve completely independent clusters.

Unless, as noted above, you’ve done a ton of in-house multi-tenancy work to
provide full stack security and compliance assurances and audits.

~~~
aub3bhat
However you may split clusters by role/access. You will have I/O bound
ETL/Interactive tasks as primary use case. Thus the investment in security
will be constant whether you reuse the same infrastructure for compute/memory
bound ML tasks or not.

------
fredliu
Not sure if it's related, and AWS Athena is not a Database in a strict sense
(or in any sense at all since it's not a transactional DB), but it seems to
offer both scale and lower cost if it fits your requirements (read only, DW
like traffic)? Anybody has experience using it in prod as a DW replacement?

~~~
dswalter
I would argue this question is not really related, but we use PrestoDB (the
technology AWS Athena is built using), and it is particularly great for when
data becomes too large to fit in a "regular" database. PrestoDB and Apache
Impala are the two major open-source surprisingly-fast-distributed-SQL-query-
planning-engines. If you pair either of them with a columnar data storage
format like Parquet or ORC, you can make analytical workloads work on
inconveniently-sized data. AWS Athena abstracts the cluster management for you
and runs PrestoDB on spare EC2 instances under the hood.

~~~
fredliu
As mentioned in the other comment, I was actually only thinking about use
cases like Athenta + csv_on_s3. and that's the really "low cost" approach I
was referring to for analytical data.

------
microcolonel
I am overloaded with acronyms. For crying out loud can we please just say what
we mean instead of sending everyone down an acronym wormhole three whitepapers
deep?

~~~
mcguire
From this post:

" _Our first paper is Scalable Distributed Subgraph Enumeration or "SEED". We
will have a future post about rules for picking acronyms._"

I suspect the author would agree.

------
yazr
Is this guy for real?

He is limited by not having an "expensive macbook with 4 real cores" ?? He
gave up on a test because it "paged out the 15GB of data".

Of course at these scales a cluster is x5 slower.

Can someone please explain - am i missing something here ?

~~~
shoo
> Of course at these scales a cluster is x5 slower.

So this complaint should be levelled at the papers and the academic
communities- why is it acceptable to only provide empirical evidence for these
distributed algorithms at data scales where they cannot out perform simpler
implementations? Why not require empirical performance measurement on truly
large scale examples for these kind of algorithms?

~~~
wmf
Of course academics don't have access to truly big data, so this amounts to
saying that they can't publish anything. That's fine with me but I don't think
the academic CS establishment will agree.

~~~
mcguire
" _the academic CS establishment_ "

Strictly speaking, I think that term attributes more organization to the
academic CS establishment than my experience supports.

The point here is that, if they're trying to establish that Smart Technique 2
is better than Smart Technique 1, they need to first establish that Stupid
Technique Q doesn't work at all. Which they haven't.

If they don't have access to "truly big data" to do so, well, most of the
people who want to use the Smart Techniques probably don't either.

