
Jepsen: Dgraph 1.0.2 - aphyr
https://jepsen.io/analyses/dgraph-1-0-2
======
mrjn
Author of [https://dgraph.io](https://dgraph.io) here. I'd say this is one of
the best collaborations we've done. Due to Jepsen tests, we were able to
proactively identify and fix issues in a complex real time distributed graph
DB, that is Dgraph. Dgraph's graph sharding, replication and transaction
implementation is unique and it took more work to make it all work correctly,
while still achieving good performance, and without utilizing any third party
system.

I'd like to thank Kyle for his work on Dgraph. We'll continue to expand Jepsen
tests to try and identify more issues and fix the remaining ones.

I'm around if you have questions! Cheers, Manish

~~~
jimktrains2
> Dgraph has made significant progress, but 4 of the 23 issues we identified
> remain unresolved, including the corruption of data in healthy clusters.
> This work was funded by Dgraph,

Between this and your comment above, it restores a little faith in humanity
that there are people out there who honestly want to make their work as good
as possible and not paper over cracks.

~~~
mrjn
Thank you! This is the best acknowledgment of our work I've seen so far. A lot
of blood, sweat, hairs, frustrations, and tears has gone into building Dgraph
-- and this comment by you makes me feel proud of what we're building, and the
community of people using it.

~~~
beagledude
I'll echo that as well. You all are doing it right!

------
jdcarter
Does anybody else miss the style of Aphyr's older posts, with animated Barbie
dolls and such mocking the product under scrutiny? Examples:

[https://aphyr.com/posts/317-jepsen-
elasticsearch](https://aphyr.com/posts/317-jepsen-elasticsearch)

[https://aphyr.com/posts/284-jepsen-
mongodb](https://aphyr.com/posts/284-jepsen-mongodb)

I understand this analysis was paid for by Dgraph so GIFs and memes aren't
appropriate... but I do miss them.

~~~
aphyr
Oh, I certainly took no shortage of flack for those early analyses, but I'm
glad you appreciated them! Some of my clients have even expressed
disappointment that my jepsen.io analyses for them weren't as snarky as the
aphyr.com ones! It's a tough situation though--if I were to write snarky
analyses, even for unpaid clients, it would raise the possibility that clients
could pay for favorable treatment. So... I go through an extensive process,
often involving weeks of review with my client and peers, to produce as
accurate, professional, and neutral a report as I can.

There's another aspect to this: now that I've established a reputation and
people take my work seriously, I don't have to be snarky or aggressive to get
attention drawn to these issues. And the DB landscape has changed a lot in the
last five years! Engineers _want_ to provide formalized safety guarantees.
Vendors are taking these failure modes seriously instead of dismissing them as
irrelevant! Since I'm getting paid, I can invest more time into making these
analyses real collaborations with the vendors. So I see my role now as more
about helping people reach their safety goals, and a little less about shaming
folks for making systems that didn't live up to their claims.

My other motivations--letting users know how to work with the databases
they've chosen, and giving case studies to help other engineers test and
improve their own systems, well, those are unchanged. :)

~~~
kjeetgill
Thank you for all of your work. I dove into those posts at the start of my
career and they've shaped how I think about my work tremendously. (Along with
Brendan Gregg's work!)

I still reread this and pass it around at least once our twice a year:
[https://aphyr.com/posts/313-strong-consistency-
models](https://aphyr.com/posts/313-strong-consistency-models)

It's dense and takes some time to digest but I consider it required reading
for anyone working on or with distributed systems.

~~~
aphyr
Aw, thank you! :)

------
BMarkmann
I honestly don't know that there are any posts I get more excited about on HN
than Jepsen analyses. The number of landmines distributed systems create and
the ability to suss them out in such detail / depth is really incredible.
Kudos.

~~~
amelius
Agree. But ... I sure hope that the new way of developing distributed systems
doesn't follow this pattern:

1\. write some code

2\. submit to Jepsen

3\. if errors, goto 1

The problem with this approach is obvious.

~~~
namibj
I'd hope they find themselves one or more with a master in math on logic
proving in ways that a computer can help, and actually prove that their code,
at least if compiled to LLVM bytecode, is adhering to the high-level
architectural proof. Considering that they'd only need to prove consitency,
nothing more, and especially not every little piece of code they are writing,
but just the core that handles transactional isolation, not even that the code
actually resolves in all cases (being stranded with a stuck cluster that is
made to handle a hard reboot without problems is better than a cluster that
silently violates constraints), it should not be an infeasible goal.

The main issue is that distributed database consistency without trivial,
performance-killing locking schemes is too complex to prove when writing or
using any trivial, local methods based on e.g. SMT solvers or so.

If something like cockroachdb would be proved-consistent on that level, it
would be used for applications currently employing pessimistic locking due to
a lack of trust in their database, or scaling vertically without really
needing to (there are cases which make horizontal scaling cost-prohibitive due
to the dependency chains in the algorithms that can solve them, but they can
be replaced most of the time).

~~~
aphyr
_it should not be an infeasible goal_

There's been a lot of work on both of these problems, but right now, proving
concurrent algorithms correct, and proving equivalency of those algorithms to
executable code, are very much open research problems.

------
makmanalp
Slightly tangential but dang, this is the best graph and accompanying
explanations of consistency models I've ever seen:
[https://jepsen.io/consistency](https://jepsen.io/consistency)

------
staticassertion
I've been using dgraph for a side project. I chose it because I have a really
high performance requirement and I find the query language feels 'right'.

Glad to see the progress made, and bugs found and fixed, for these Jepsen
tests.

As a side note, interesting to see a largely Go-based, real-world project
plagued with race conditions. Quite a few bugs read "but it was done in a
goroutine" or "but mutable capturing in a goroutine", and even a segfault.

~~~
baq
Go probably is the easiest language to get something working with any kind of
performance, but it doesn't help much to get something working correctly.
After dabbling with Rust I'm amazed that multithreaded programs work at all...

------
gazarsgo
For each Jepsen release my favorite thing is the code for the tests themselves
-- [https://github.com/jepsen-
io/jepsen/tree/master/dgraph/src/j...](https://github.com/jepsen-
io/jepsen/tree/master/dgraph/src/jepsen/dgraph) \-- which seems surprisingly
unreferenced in the post itself.

~~~
aphyr
Links to the code are throughout the "test design" sections of each report. :)

------
drej
FYI: There's a very recent podcast with Manish, the author of DGraph,
[https://www.dataengineeringpodcast.com/dgraph-with-manish-
ja...](https://www.dataengineeringpodcast.com/dgraph-with-manish-jain-
episode-44/)

------
antpls
Was a set of testing tools used to build the analysis? It would be nice to be
able to reproduce the results.

------
Sir_Cmpwn
Important: contrary to the marketing material, dgraph is _not_ open source -
it uses the Commons Clause we've been discussing over the past few days.
Discussion here:

[https://github.com/dgraph-io/dgraph/issues/2545](https://github.com/dgraph-
io/dgraph/issues/2545)

~~~
tunesmith
When did the definition of "open source" change? I always took "open source"
to be an umbrella term that contained both Apache-type licenses and GPL/FSF-
type licenses. I get that Commons Clause is more restrictive in ways than GPL,
but GPL is definitely open source. So what makes Commons Clause not open
source, if GPL is?

~~~
Sir_Cmpwn
The definition of "open source" never changed. Apache and the GPL are both
open source licenses. Appending the Commons Clause removes the zeroth freedom
(FSF lingo) or the first clause (OSI lingo).

[https://opensource.org/osd](https://opensource.org/osd)

[https://www.gnu.org/philosophy/free-
sw.en.html](https://www.gnu.org/philosophy/free-sw.en.html)

~~~
eecc
Which in today’s world where “we were planning to fund our lives offering
professional services around a production deployment” is kindly railroaded by
rock bottom AWS prices is a-ok...

That’s the server side version of the anti-tivo clause that made the GPLv3.

I see a v4 coming

~~~
Sir_Cmpwn
Withholding judgement on whether or not using a non-open license is the
appropriate choice for Dgraph, the problem persists that their language is
deliberately misleading. Also, GPLv3 already prevents you from tacking on the
Commons Clause, any software using any GPL-family license cannot include the
Commons Clause. Finally, AGPL was explicitly designed years ago to address the
case you raise, i.e. the server-side Tivo problem.

~~~
sanderjd
I think calling their language "deliberately misleading" is deliberately
misleading when there is a representative from the project here telling you
that they do not intend to mislead (thus even if it is misleading, it isn't
_deliberate_ ). But I guess you just seem comfortable calling that person a
liar. From reading this one thread, it seems like a bummer to me that you're
assuming bad faith. But I had never heard of Dgraph until this thread, so
maybe there is history that I'm naively ignoring.

~~~
Sir_Cmpwn
Yes, I'm willing to be frank and state explicitly that I'm calling this person
a liar.

~~~
sanderjd
What's your history with them that leads you to this conclusion? If you don't
have any, and you're simply making assumptions based on their choice of
licensing terms, what would your argument look like if you instead imagined
them as people acting in good faith, trying to provide software to people on
terms that allow them to both continue improving the software while also
providing for their families?

I guess the reason I have trouble assuming bad faith with things like this is
because it would be much easier for these people to just get high paying jobs
at big software companies than to try to eke out a living making a useful
source-provided (it would be so much easier to just use "open source" here,
but that is verboten) product while arguing with people like you.

~~~
Sir_Cmpwn
I think I've explained myself quite well in the thread. My personal,
subjective conclusion is of dishonesty. I cannot give you an objective
measure, the evidence is in this thread and you can draw your own conclusions.
Being a generally good person who is trying to put food on the table for their
family is not mutually exclusive with lying and resorting to questionably
ethical means of doing so.

~~~
sanderjd
I've (unfortunately) read the whole thread. You've jumped to an uncharitable
conclusion. A bunch of people have pointed this out. You can disagree while
assuming good faith. It's too bad you aren't doing that here. I agree with you
that it isn't a very good choice for licensing terms.

~~~
Sir_Cmpwn
>You can disagree while assuming good faith

Yes, but bad faith exists, and the assumption of good faith is not always
appropriate. I don't think that Dgraph is operating in good faith, and the
evidence is enough that I am unable to suspend my disbelief. If I were Dgraph
and my actions were so severely wrong that people were assuming bad faith, I
would want to quickly correct that. But in fact, most of the concerns have
gone un-answered.

------
rdxm
gotta say, if you do distributed systems, subject yourself to Jepsen analysis
and publish it, you win a truck-load of points in my book.

------
steven1982
Dgraph released 1.0.6 and replication stopped working, I gave up on dgraph.
It's just not reliable and immature. It's expected since there is only single
developer making commits from months.

------
steven1982
Dgraph has been bragging about being used by big companies. Jepsen test
clearly shows that there were critical bugs and no sane company can use it in
production as primary database, unless they don't care about their data.

~~~
vemv
Same could be said about almost every system tested by Jepsen, so your
observation seems unfair.

