
Jepsen: TiDB 2.1.7 - aphyr
https://jepsen.io/analyses/tidb-2.1.7
======
dongxu
PingCAP CTO here, I really enjoy the cooperation with Kyle this two months. I
want to talk a little bit about my feelings. Before the "official" Jepsen
test, TiDB has already written its own Jepsen test cases for a long time. Our
past Jepsen test code has become a good starting point in this test. In
addition, as a result, the test case written by the database developer will
inevitably have some blind spots. This is even if you have your own Jepsen
test passed, I still recommend you to do the official test by Kyle's team.

In the process of testing, we have no way to influence what Kyle tests.
Usually, only Kyle will tell us when abnormal behaviour occurs. Our main job
is to analyze the cause and then fix it. Here are some tips for recommending
friends who want to do Jepsen test in the future:

1\. Carefully check your documentation before testing to make sure it is
consistent with your database implementation.

2\. It is best for someone on the team to understand and build your own Jepsen
test environment internally and test parallelly with the Kyle team.

3\. Simplified deployment process is very helpful

4\. Be sure the version of your DB before testing. Usually, you can select one
when starting the test and select one before the end.

I really enjoy working with Kyle. I would say Kyle has a good sense of humor
:), and I hopes to have another chance to cooperate Kyle and his team in the
future! Thank you.

------
yingw787
@aphyr I love reading these discussions about Jepsen! How would an individual
contributor get involved with Jepsen and contribute during free time? Is most
of the work related to adding public Jepsen tests for each database or is
there work available with the framework as well? I'd love to learn more about
Jepsen and distributed systems testing!

~~~
aphyr
Hi there! There's work to do on both, and I encourage you to contribute to
core if you like! We could really use performance improvements on the checkers
--Knossos and set-full are great, but much more expensive than they need to
be. It'd be nice to have a robust mechanism for adjusting CLOCK_MONOTONIC &
friends. I don't have a good way to inject filesystem level faults, like
simulated node crashes where non-fsynced data is lost. jepsen.control is
plagued by what I think are race conditions in Jsch, which I've never had the
chance to dig in and fix.

Writing tests is also great! Pick a system you like and dive in. There's a
full tutorial ([https://github.com/jepsen-
io/jepsen/tree/master/doc/tutorial](https://github.com/jepsen-
io/jepsen/tree/master/doc/tutorial)) that walks you through designing and
running a test from scratch.

~~~
macdice
Database hacker here. I have dreamed of a filesystem fault injection system
like this: a fuse filesystem that just passes through to a real filesystem,
but also writes a log of syscalls. Then a test mode that can simulate crashes
at various points in the log history, with pages (of some size) flushed at
random (or other settings) until fsync is issued and completes. This could be
used for recovery testing, to looking for missing fsyncs and torn page
resilience.

~~~
nh2
I'm currently working on a "scriptable strace" in Haskell that you can use to
inject arbitrary faults via syscall modification.

[https://github.com/nh2/hatrace](https://github.com/nh2/hatrace)

I've used it so far to find and deterministically reproduce situations where
the Haskell compiler and build tools wrote files in non-atomic/non-durable
ways that would lead to failures when the machine was hard-rebooted at the
wrong time.

An example test is
[https://github.com/nh2/hatrace/blob/7300dbf2c/test/HatraceSp...](https://github.com/nh2/hatrace/blob/7300dbf2c/test/HatraceSpec.hs#L182-L229),
and the compiler test is just below.

~~~
macdice
Nice! Yeah, syscall interception may be a better way to do this than fuse.

So, the thing I want is the ability to take a log (which probably also has
checkpoints in it, I dunno) captured by normal operation of a database, and
then run a super long slow test that puts the (fake) filesystem into a state
as of various points along that log where some not-yet-fsync'd changes are
lost, on a sector-by-sector basis (most systems rely on at least 512 sector
pages being atomic, so you'd probably do it at that size). Then you'd test if
the database can recover successfully and pass some sanity test at all those
points along the log.

People do this kind of testing by pulling the power on busy databases. This
would simulate pulling the power at a huge number of times, on a maximally
non-forgiving filesystem/hardware (= 512 byte sector atomicity, flushed in
random order).

------
orthecreedence
This is great, I've been evaluating CockroachDB for a high-write data
situation, and TiDB came up as an alternative. CRDB having a Jepsen test and
TiDB not having one made the decision pretty favored to CRDB. Looks like I can
really give both a look now.

------
dgreensp
Most Jepsen reports seem to find glaring flaws, including subsystems that add
complexity while reducing correctness, like the “auto-retry” features here, or
novel consensus schemes that just aren’t airtight and therefore technically
don’t work.

~~~
kbenson
I think Bryan Cantrill put it well in one of his more recent talks. Jepsen
reports used to be a lot more fun because projects made grand claims first,
and then Jepsen came to actually test than. Now projects run Jepsen ahead of
time to qualify their claims so there's less big GOTCHA! moments.

So, less fun but better quality products. A few more years and maybe Jepson
won't find many flaws at all as they'll be identified in development using the
same tools.

~~~
aphyr
I agree, but I should also note that many of the recent analyses I've
published were on systems which wrote their own tests and missed significant
errors. Like, "database loses updates frequently in normal operation"
significant!

I think a bunch of factors are in play here. Sometimes vendors discover bugs
with Jepsen, then adjust the test so that it passes, rather than fixing the
underlying bug. Consul, IIRC, found errors with Jepsen, and adjusted a timeout
parameter to prevent the test from observing concurrent primary nodes, which
meant the test passed. The underlying problem was still there--but no longer
visible. Sometimes you'll see people turn a workload that does a read-modify-
write transaction into a single-statement update transaction, in such a way
that the database can execute it atomically. That's still a legal and sensible
test, but may not measure the same type of transactional isolation.

Other times, vendors write a custom checker, but haven't tested to make sure
that it can catch the anomaly that it's supposed to. Maybe a checker is copy-
pasted from another test I wrote for a different system which used different
names or data structures to represent histories, so it executes successfully,
but in a trivial way--maybe it fails to realize any elements were added to a
set, so it assumes every read is legal, since all 0 elements are present!

Other times tests are correctly written, but only run for a short time--maybe
the nemesis schedule they designed doesn't actually cause leader elections to
take place, so you never observe that phase change in the system. Or the test
only performs a handful of operations and sits idle until the end of the test.
Or every request fails, and the test passes trivially, since we never observed
an illegal result. Some of these mistakes I can address with better analyzers
and prompting users with guided error messages. Others you have to
qualitatively infer by looking at the graphs and histories. That's something I
work on training people to do in my classes, but I haven't written good guides
for it online yet.

And of course, there's a familiarity problem: with Jepsen, I've been working
on and with this tool for six years, so I'm intimately familiar with its
behavior and testing philosophy. So much of that knowledge is implicit and
intuitionistic for me, and users who are adopting that tool fresh don't have
the benefit of that experience! It's something you can build with time, and I
can help transfer with writing and teaching. Working on it! :)

------
atombender
This is a pretty disappointing result.

I've noted TiDB's weak transaction isolation guarantees before, which were not
well documented. In particular, they claimed phantom reads were not possible,
which this test shows isn't true.

That said, it looks like these issues aren't fundamental design problems and
can be fixed.

~~~
aphyr
I don't think these particular phantoms are likely to be fixed--they're legal
under SI, and that's all TiDB is intended to provide, at least right now. They
might do serializability later though!

------
nullwasamistake
This is great news. The only blocker is lack of FK constraints. I know this is
a performance issue, but it's a blocker for those expecting it to behave like
MySQL. We have a lot of bad queries (unfortunately) that rely on constraints
for correctness. It's too dangerous to do any migration that eliminates them
without at least a reduced performance setting to warn about violations that
we can run for a month or two before committing 100%.

It pains me to say this, but strong ties to China company-wide will likely
mean my org never adopts this, fk constraints aside. A victim of the trade
war? Yes. But many of our clients are strictly "no Chinese vendors for
libraries or dependencies", even if the data is on our own servers.

I'm not sure if anything can be done at this point, but building an isolated
US based org would help immensely with adoption. The cap table also being
mostly Chinese investors is enough to stop most US companies at the DD phase

------
ryanworl
@aphyr:

"We have not evaluated filesystem or disk faults with TiDB, and cannot speak
to crash recovery. Nor have we tested dynamic membership changes. Both might
be fruitful avenues of investigation for future research."

How do you choose what components to test once you get past obvious stuff like
the highest claimed isolation level?

~~~
aphyr
I give suggestions to my clients, and ask them what they care about testing
most, and we... pretty much always agree on what direction the research goes
next. FS/disk faults I don't have good tooling for yet, so I'd have to invest
some time into developing those tools--makes sense to defer that work.
Membership changes are infrequent and surprisingly difficult to write
automation for, so I think it makes more sense to focus on the static-cluster
case first.

~~~
jinqueeny
Regarding FS/Disk faults, here is some information about what TiDB does:

tl;dr: Chaos Engineering

Some links:
[https://cdn2.hubspot.net/hubfs/4466002/Using%20Chaos%20Engin...](https://cdn2.hubspot.net/hubfs/4466002/Using%20Chaos%20Engineering%20to%20Build%20a%20Reliable%20TiDB.pdf)

[https://www.pingcap.com/blog/chaos-practice-in-
tidb/](https://www.pingcap.com/blog/chaos-practice-in-tidb/)

Regarding the Fault Injection tools we are using:

\- Kernel Fault Injection, the Fault Injection Framework included in Linux
kernel, you can use to implement simple fault injections to test device
drivers.

\- SystemTap, a scripting language and tool diagnose of a performance or
functional problem.

\- Fail, gofail for go and fail-rs for Rust

\- Namazu: a programmable fuzzy scheduler to test a distributed system.

We also built our own Automatic Chaos platform, Schrodinger, to automate all
these tests to improve both efficiency and coverage

------
jinqueeny
See the companion blog from PingCAP: [https://pingcap.com/blog/tidb-passes-
jepsen-test-for-snapsho...](https://pingcap.com/blog/tidb-passes-jepsen-test-
for-snapshot-isolation-and-single-key-linearizability/)

------
PeterCorless
As always, deeply insightful and merciless. :)

~~~
segmondy
I don't see what's merciless about it. DB systems have various claims, he just
goes about asserting if those claims are true. I think a serious take away
from these is that all DB systems have issues, there are some faults that do
happen, know them and design for it. Any DB that claims to have no fault is
lying.

------
gorn
Hi @aphyr. I'm a great fan of your work with Jepsen although I know very
little about the fault tolerance of distributed systems. Are there any
resources you would recommend on the subject? I am an application developer so
I don't see myself writing a database in the future. Still it would be great
to learn about the concepts.

Cheers!

~~~
aphyr
I've written an outline that might be useful!
[https://github.com/aphyr/distsys-class](https://github.com/aphyr/distsys-
class)

~~~
yehosef
This is fantastic - thanks!

------
shenli3514
Thanks for your excellent work!

