
Jepsen: PostgreSQL 12.3 - aphyr
http://jepsen.io/analyses/postgresql-12.3
======
brandur
Personally, this kind of thing actually gives me _more_ confidence in Postgres
rather than less. The core team's responsiveness to this bug report was
incredibly impressive.

Around June 4th, the article's author comes in with a bug report that
basically says "I hammered Postgres with a whole bunch of artificial load and
made something happen" [1].

By the 8th, a preliminary patch is ready for review [2]. That includes all the
time to get the author's testing bootstrap up and running, reproduce, diagnose
the bug (which, lest us forget, is the part of all of this that is actually
hard), and assemble a fix. It's worth noting that it's no one's job per se on
the Postgres project of fix this kind of thing — the hope is that someone will
take interest, step up, and find a solution — and as unlikely as that sounds
to work in most environments, amazingly, it usually does for Postgres.

Of note to the hacker types here, Peter Geoghegan was able to track the bug
down through the use of rr [4] [5], which allowed an entire problematic run to
be captured, and then stepped through forwards _and_ backwards (the latter
being the key for not having to run the simulation over and over again) until
the problematic code was identified and a fix could be developed.

\---

[1] [https://www.postgresql.org/message-
id/CAH2-Wzm9kNAK0cbzGAvDt...](https://www.postgresql.org/message-
id/CAH2-Wzm9kNAK0cbzGAvDtdJi-rj_ngsBbRX0i_DKdjYxqJnzNA%40mail.gmail.com)

[2] [https://www.postgresql.org/message-
id/CAH2-Wzk%2BFHVJvSS9VPP...](https://www.postgresql.org/message-
id/CAH2-Wzk%2BFHVJvSS9VPPJ_K9w4xwqeVyfnkzYWtWrBzXJSJcMVQ%40mail.gmail.com)

[3] [https://www.postgresql.org/message-
id/CAH2-WznTb6-0fjW4WPzNQ...](https://www.postgresql.org/message-
id/CAH2-WznTb6-0fjW4WPzNQh4mFvBH86J7bqZpNqteVUzo8p%3D6Hg%40mail.gmail.com)

[4]
[https://en.wikipedia.org/wiki/Rr_(debugging)](https://en.wikipedia.org/wiki/Rr_\(debugging\))

[5] [https://www.postgresql.org/message-
id/CAH2-WznTb6-0fjW4WPzNQ...](https://www.postgresql.org/message-
id/CAH2-WznTb6-0fjW4WPzNQh4mFvBH86J7bqZpNqteVUzo8p%3D6Hg%40mail.gmail.com)

~~~
gen220
Thank you for this comment that gives credit where it's due, this is a very
impressive set of threads to read through.

And I agree. For me, one of the most important measures of the reliability of
a system is how that system responds too information that it might be wrong.
If the response is defensiveness, evasiveness, or persuasive in any way, i.e.
of the "it's not _that_ bad" variety, run for the hills. This, on the other
hand is technical, validating, and prompt.

Every system has bugs, but depending on these cultural features, not every
system is capable of systematically removing those bugs. With logs like these,
the pg community continues to prove capable. Kudos!

~~~
shawn-butler
What are the storage requirements for using rr for intense or longer debugging
sessions?

~~~
petergeoghegan
The exact recording in question was about 125MB, and that was after I
materialized it using "rr pack".

I'd say that the storage overhead is unlikely to be a concern in almost all
cases. It's just something that you need to keep an eye on.

------
feike
This postgresql mailing list thread allows you to read along with the
PostgreSQL developers and Jepsen, seems like a very useful discussion:
[https://www.postgresql.org/message-
id/flat/db7b729d-0226-d16...](https://www.postgresql.org/message-
id/flat/db7b729d-0226-d162-a126-8a8ab2dc4443%40jepsen.io)

~~~
aeontech
This is just such a pleasure to read, even as someone that has only surface
awareness of database internals at all. Both for the incredibly friendly and
professional tone, and for the obvious deep technical knowledge on both sides.

And that first email, my god, that should be titanium-and-gold-plated standard
of a bug report.

~~~
bloopernova
> that first email, my god, that should be titanium-and-gold-plated standard
> of a bug report.

It's a thing of beauty. It even includes versions of software used!

My daily experience with bug reports are that they 50/50 won't even include a
description, just a title. It's such a cliche already, but "project name is
broken" makes my blood boil. What environment? What were you doing? Is this
production? How do I test this bug? (from an Ops perspective) When did you
notice this? Has anything changed recently to possibly cause an error?

Arg, my blood pressure!

/offtopic, sorry.

------
arghwhat
It is very rare to see a Jepsen report that concludes with a note that a
project is being too humble about their consistency promises.

Finding effectively only a single obscure and now fixed issue where real-world
consistency did not match the promised consistency is pretty impressive.

~~~
jennyyang
The testing on Zookeeper found no problems, which was extremely impressive
given how thorough Aphyr is.

~~~
tatersolid
Strong consistency is easier when throughput and concurrency are limited.

Zookeeper is impressive in many ways. However, unless something changed
drastically in the last two years, Zookeper throughput will always limit it to
configuration/metadata/control-plane rather than a primary/data-plane use
cases.

------
sandGorgon
> _PostgreSQL has an extensive suite of hand-picked examples, called
> isolationtester, to verify concurrency safety. Moreover, independent
> testing, like Martin Kleppmann’s Hermitage has also confirmed that
> PostgreSQL’s serializable level prevents (at least some!) G2 anomalies. Why,
> then, did we immediately find G2-item with Jepsen? How has this bug
> persisted for so long?_

This is super interesting. Jepsen seems to be like Hypothesis for race
conditions: you specify the race condition to be triggered and it generates
tests to simulate it.

Yesterday, Gitlab acquired a fuzz testing company[1]. I wonder if Jepsen was
envisioned as a full CI integrated testing system

[1]
[https://m.calcalistech.com/Article.aspx?guid=3832552](https://m.calcalistech.com/Article.aspx?guid=3832552)

~~~
aphyr
Yes. Jepsen and Hypothesis both descend from a long line of property-based
testing systems--mostly notably, Haskell & Erlang's QuickCheck. Jepsen makes a
number of unusual choices specific to testing concurrent distributed systems:
notably, we don't do much shrinking (real-world systems are staggeringly
nondeterministic). Jepsen also includes tooling for automated deploys, fault
injection, a language for specifying complex concurrent schedules,
visualizations, storage, and an array of sophisticated property checkers.

~~~
sandGorgon
Is Jepsen for testing - say the microservices for Uber?

Or is it specific to the people who build things like databases, api
frameworks,etc.

~~~
aphyr
You can test pretty much any kind of concurrent system using Jepsen: in-memory
data structures, filesystems, databases, queues, APIs, services, etc. Not all
the tooling is applicable to every situation, but it's pretty darn general.

~~~
theptip
Do you know of anyone using Jepsen to torture their microservices? This sounds
like a really interesting usecase.

~~~
aphyr
I hear occasional rumors, but nobody's published.

------
camgunz
Reading through the source of Elle:

> "I cannot begin to convey the confluence of despair and laughter which I
> encountered over the course of three hours attempting to debug this issue.
> We assert that all keys have the same type, and that at most one integer
> type exists. If you put a mix of, say, Ints and Longs into this checker, you
> WILL question your fundamental beliefs about computers" [1].

I feel like Jepsen/Elle is a great argument for Clojure, reading the source is
actually kind of fun. Not what you'd expect for a project like this.

[1]: [https://github.com/jepsen-
io/elle/blob/master/src/elle/txn.c...](https://github.com/jepsen-
io/elle/blob/master/src/elle/txn.clj#L43)

~~~
agambrahma
Wonder if this "manual type constraints"-style code is pre-"spec"

~~~
aphyr
Normally I'm a core.typed person, but static type constraints don't quite make
sense here. We _want_ heterogeneity in some cases (e.g. you want to be able to
mix nils and ints), but not in others (e.g. this short and int mixing, which
_could_ be intentional, but also, might not be)

I've considered spec as well, but spec has a weird insistence that a keyword
has exactly one meaning in a given namespace, which is emphatically _not_ the
case in pretty much any code I've tried to verify. Also its errors are... not
exactly helpful.

~~~
lemming
This is interesting to me as a Clojure person - you would be approximately the
first person I've seen using core.typed since CircleCI's post in 2015
discussing why it didn't work for them. Are you using more modern versions of
core.typed? What's the experience like these days?

~~~
aphyr
I don't use it often. In general, I've found the number of bugs I catch with
core.typed doesn't justify the time investment in convincing things to
typecheck--my tests generally (not always, of course!) find type issues first.
I also tend to do a lot of weird performance-oriented stateful stuff with java
interop, which brings me into untyped corners of the library.

That said, I've found core.typed helpful in managing complex state
transformations, especially in namespaces which have, say, five or six similar
representations of the same logical thing. What do you do when a "node" is a
hostname, a logical identifier in Jepsen, an identifier in the database
itself, a UID, and a UID+signature pair? Managing those names can be tricky,
and having a type system really helps.

~~~
lemming
Interesting, thanks for the feedback. I often miss having a type system in
Clojure and I was really hoping that core.typed might be more ergonomic these
days. It sounds like it's still a pretty serious tradeoff, and probably not
worth it for most code.

------
micimize
This is my understanding of what a G2-Item Anti-dependecy Cycle is from the
linked paper example:

    
    
      -- Given (roughly) the following transactions:  
    
      -- Transaction 1 (SELECT, T1)
      with all_employees as (
        select sum(salary) as salaries
        from employees
      ),
      department as (
        select department, sum(salary) as salaries
        from employees group by department
      )
      select sum(all_employees.salaries) - sum(department.salaries);
    
      -- Transaction 2 (INSERT, T2)
      insert into employees (name, department, salary)
      values ('Tim', 'Sales', 70000);
    
      -- G2-Item is where the INSERT completes between all_employees and department,
      -- making the SELECT result inconsistent 
    
    

This is called an "anti-dependency" issue because T2 clobbers the data T1
depends on before it completes.

They say Elle found 6 such cases in 2 min, which I'm guessing is a "very big
number" of transactions, but can't figure out exactly how big that number is
based on the included logs/results.

Also, "Elle has found unexpected anomalies in every database we've checked"

~~~
aphyr
Yeah, it was relatively infrequent in that particular workload--dramatically
less than PostgreSQL's "repeatable read" exhibited. These histories are
roughly 15K successful transactions long--see the stats field in results.edn.
I'm hesitant to make strong statements about frequency, because I suspect this
kind of thing depends strongly on workload, but I would hazard a gueesssss
that it's not super common.

------
mekoka
Props to Jensen for exposing this longtime bug. Props to the PG team for
identifying the culprit and their response. This report just strengthens my
faith in the project.

------
redwood
It would be great to see Jepsen testing on distributed Postgres as this is a
single node issue they've found here. In prod don't folks run HA?

~~~
aphyr
I started this analysis intending to do just that--it's been difficult,
however, to figure out which of the dozens of replication/HA configurations to
actually test. I settled on Stolon, since it seemed to make the strongest
safety claims. However, I found bugs which turned out to be PostgreSQL's
fault, so I backed off to investigate those first.

~~~
qeternity
And herein lies the rub: HA Postgres is an extremely painful proposition.
Based on our non-scientific research, Patroni seems to be the most battle
tested solution, and as popular if not more so than Stolon.

~~~
aphyr
Patroni's documentation also seems to suggest that even with the strongest
settings, it can lose transactions; Stolon makes stronger claims.

~~~
feike
Patroni does have synchronous_mode_strict setting, which may be what you're
looking for:

This parameter prevents Patroni from switching off the synchronous replication
on the primary when no synchronous standby candidates are available. As a
downside, the primary is not be available for writes (unless the Postgres
transaction explicitly turns of synchronous_mode), blocking all client write
requests until at least one synchronous replica comes up.

[https://patroni.readthedocs.io/en/latest/replication_modes.h...](https://patroni.readthedocs.io/en/latest/replication_modes.html#replication-
modes)

edit: seems I missed this discussion on twitter:
[https://twitter.com/jepsen_io/status/1265626035380346881](https://twitter.com/jepsen_io/status/1265626035380346881)

~~~
aphyr
Er, again, the docs say "it is still possible to lose transactions even when
using synchronous_mode_strict". I've talked about this with some of the Stolon
folks on Twitter, and we're not exactly sure how that manifests--possibly an
SI or SSI violation.

~~~
merb
why is stolon more safe than? I tought stolon uses synchronous replication of
pg aswell?

------
KingOfCoders
We laughed when this happend to MongoDB.

The difference though is the reaction from the vendor.

~~~
ldng
For me, MongoDB has track record of bolstering a lot ("webscale") and
hiding/denying mistakes.

PostgreSQL is quite the opposite on that front, confident yet open to critics
and abble to admit mistakes. Hell, I've even them present their mistakes at
conferences and ask for help.

~~~
blablabla123
Yes, for instance not returning errors in some cases when writes fail. I think
this was until version 2 but to be fair they fixed this kind of stuff and
started to deal with this differently later on. However their reputation never
fully recovered from this.

------
pkilgore
> Neither process crashes, multiple tables, nor secondary-key access is
> required to reproduce our findings in this report. The technical
> justification for including them in this workload is “for funsies”.

Always read the footnotes!

------
reitanqild
By the way: where does the Jepsen name come from?

I have wondered more than once and my browsing and searching skills are
failing me on this one.

Edit: The closest link I can find is "Call me maybe" but I am not able to find
a causation or even a direct link or mention for now.

~~~
perlgeek
I don't actually know, but I could imagine it's a tribute to Carly Rae Jepsen
and their song "Call me maybe".

I dimly recall that either Aphyr's blog or the jepsen blog was called "call me
maybe" in the earlier days.

~~~
twunde
Here's at least one reference to it:
[https://www.informationweek.com/database/the-man-who-
torture...](https://www.informationweek.com/database/the-man-who-tortures-
databases/d/d-id/1111407) And it looks like earlier versions of the github
project looked more like this:
[https://github.com/threadwaste/jepsen](https://github.com/threadwaste/jepsen)
with references to the Carly Rae Jepson song in the project description in
both the README and in github.

Actually, it looks like the original talk (Slides:
[https://aphyr.com/media/jepsen-ricon-
east.pdf](https://aphyr.com/media/jepsen-ricon-east.pdf) has multiple
references) and the original blog post has a slug referring to the song
[https://aphyr.com/posts/281-call-me-maybe-carly-rae-
jepsen-a...](https://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-
the-perils-of-network-partitions)

------
threeseed
I am still wondering when we will see PostgreSQL being tested in a HA form.

It's just extraordinary to me that it's 2020 and it still does not have a
built-in, supported set of features for supporting this use case. Instead we
have to rely on proprietary vendor solutions or dig through the many obsolete
or unsupported options.

~~~
castorp
There is a built-in supported set of features for high availability. What
exactly are you missing?

~~~
phaemon
The option to install postgres on three instances, specify that they're in
cluster "foo" and then it just works, including automatically fixing any
issues when one of the instances drops out and rejoins.

That's what other DBs have but it seems to be missing from postgres. If it now
exists could you point me to the doc explaining how to do this?

------
popotamonga
What does this really mean? I just migrated from mongo to Pg.

~~~
snuxoll
There were edge cases in PostgreSQL’s SERIALIZABLE isolation level - which is
supposed to ensure that concurrent transactions behave as if they were
committed sequentially.

Specifically - if running a transaction as SERIALIZABLE there was a very small
chance that you might not see a rows inserted by another transaction that
committed before you in the order. Many applications don’t need this level of
transaction isolation - but for those that do it’s somewhat scary to know this
was lurking under the bed.

Every implementation of a “bank” system where you keep track of deposits and
withdrawals is a use-case for SERIALIZABLE, and this means a double-spend
could happen because the next transaction didn’t see an account just had a
transaction that drained the balance, for example.

Props to Jepsen for finding this.

~~~
detaro
The common bank example as I understand it doesn't require serializable, but
only snapshot isolation: If two transactions both drain the source balance,
the one that commits last will fail, because its snapshot doesn't match the
state anymore.

~~~
snuxoll
If you’re UPDATEing a balance on some account table - yes. If you’re using a
ledger and calculating balances (which you SHOULD) then SERIALIZABLE is
needed.

------
rolls-reus
So this does not affect SSI guarantees if the transactions involved all
operate on the same row? Is my understanding correct? For instance can I
update a counter with serializable isolation and not run into this bug?

~~~
aphyr
I think so, yeah. You _could_ theoretically have a G2-item anomaly on a single
key, but in PostgreSQL's case, the usual write-set conflict checking seems to
prevent them.

------
zeroimpl
For the repeatable read issue, I don't intuitively understand why the
violation mentioned would be a problem. In particular, even though the
transaction sequence listed wouldn't make sense for a serializable level, it
seems consistent with what I'd expect from repeatable read (though I have not
read the ANSI SQL standard's definition of repeatable read).

Any insights into why we should want repeatable read to block that? It feels
like blocking that is specifically the purpose of serializable isolation.

~~~
aphyr
The report talks at length about this, but maybe it bears rephrasing!

The ANSI definitions are bad: they allow multiple interpretations with varying
results. 25 years ago, Berenson, O'Neil, et al. published a paper showing the
ANSI definitions had this ambiguity, and that what the spec _meant_ to define
should have been a broader class of anomalies. They literally say that the
broad interpretation is the "correct" one, and the research community
basically went "oh, yeah, you're right". Adya followed up with generalized
isolation level definitions, and pretty much every paper I've read has gone
with these versions since. That didn't make its way back into the SQL spec
though: it's still ambiguous, which means you can interpret RR as allowing
G2-item.

Why prevent G2-item in RR? Because then the difference between repeatable read
and serializable is specifically _phantoms_ , rather than phantoms plus...
some other hard-to-describe anomalies. If you use the broad/generalized
interpretation, you can trust that a program which only accesses data by
primary key, running under repeatable read, is actually serializable. That's a
powerful, intuitive constraint. If you use the strict interpretation, RR
allows Other Weird Behaviors, and it's harder to prove an execution is
correct.

For a very thorough discussion of this, see either Berenson or Adya's papers,
linked throughout the report.

~~~
zeroimpl
Thanks, I got the part about the spec being ambiguous, am more interested in
the "why" aspect, since the current behaviour seems intuitive to the name
"repeatable read". But on closer inspection, I see PostgreSQL's repeatable
read blocks phantom reads even though the ANSI spec permits that! I don't get
why phantom reads would be acceptable under "repeatable read"... I probably
should give those papers a read some time. But in the meantime, given the
choice of phantom reads or G2-item, I think I'd pick blocking phantom reads.
(It might be nice to have the option to choose though!)

In PostgreSQL's case, if they somehow made repeatable read to prevent G2-item
without sacrificing the phantom reads, would that mean repeatable read is then
"serializable" according to the ANSI definition?

~~~
aphyr
_But in the meantime, given the choice of phantom reads or G2-item, I think I
'd pick blocking phantom reads._

Well... it's not quite so straightforward. SI still allows some phantoms. It
only prohibits some of them.

 _In PostgreSQL 's case, if they somehow made repeatable read to prevent
G2-item without sacrificing the phantom reads, would that mean repeatable read
is then "serializable" according to the ANSI definition?_

I'm not quite sure I follow--If you're asking whether snapshot isolation
(Postgres "Repeatable Read") plus preventing G2-item is serializable, the
answer is no--that model would still allow G2 in general--specifically, cycles
involving non-adjacent rw dependencies with predicates.

------
senderista
I don’t know why the author is surprised that Postgres offers stronger
guarantees than serializability in practice. Serializability per se allows
anomalies that would be disastrous in practice:
[http://dbmsmusings.blogspot.com/2019/06/correctness-
anomalie...](http://dbmsmusings.blogspot.com/2019/06/correctness-anomalies-
under.html).

~~~
aphyr
It's not particularly surprising, but it is noteworthy. Other systems I've
tested have offered serializability but _not_ strict serializability: for
instance, CockroachDB.

------
emilyst
Ah, now I know why you hopped on IRC finally last week. :)

------
emmelaich
@aphyr could you please clarify this sentence?

> _This behavior is allowable due to long-discussed ambiguities in the ANSI
> SQL standard, but could be surprising for users familiar with the
> literature._

Should that be "not familiar"? And which literature - the standard or the
discussions?

~~~
aphyr
Familiar. If you've read any of the literature on snapshot isolation or
isolation levels in general (Berenson, Bernstein, Gray, Melton, O'Neil,
O'Neil, Adya, Liskov, Fekete, Cerone, Bailis, Alvisi, Clement, Crooks, Pu...)
you would probably not expect to see SI treated as stronger than RR. The paper
which first defined SI says it's not. Heck, Jim Melton, who _wrote_ that part
of the ANSI spec, is a co-author on that paper, which goes on to say the
strict interpretation is incorrect and not what ANSI intended!

~~~
emmelaich
Thank you.

BTW, nice to see one of of my lecturers (Fekete) among the names!

------
wlll
Thanks for doing these, they're incredibly interesting, useful, amusing (Oh
no! the schadenfreude!) and also, incredibly inspiring to me to be a better
engineer, so thank you again :)

------
julienfr112
Can (should ?) Jepsen tests be integrated in PostGres CI/CT ? Can we raise
money for that ?

------
ordx
Any plans to test any other NoSQL databases? I'm interested in MarkLogic

~~~
aphyr
No, Jepsen is over, sorry!

Jokes aside, Marklogic is welcome to pay me. Each one of these reports takes
weeks to months of full-time work.

~~~
arthurcolle
This is probably outside the purview of Jepsen but have you ever looked at
Kdb, or the language it is build on top of? (K)

Curious to hear your thought on it! Would love a Jepsen style analysis of kdb

------
k0k-xox
I think its funny

