In the process of testing, we have no way to influence what Kyle tests. Usually, only Kyle will tell us when abnormal behaviour occurs. Our main job is to analyze the cause and then fix it.
Here are some tips for recommending friends who want to do Jepsen test in the future:
1. Carefully check your documentation before testing to make sure it is consistent with your database implementation.
2. It is best for someone on the team to understand and build your own Jepsen test environment internally and test parallelly with the Kyle team.
3. Simplified deployment process is very helpful
4. Be sure the version of your DB before testing. Usually, you can select one when starting the test and select one before the end.
I really enjoy working with Kyle. I would say Kyle has a good sense of humor :), and I hopes to have another chance to cooperate Kyle and his team in the future! Thank you.
So, less fun but better quality products. A few more years and maybe Jepson won't find many flaws at all as they'll be identified in development using the same tools.
I think a bunch of factors are in play here. Sometimes vendors discover bugs with Jepsen, then adjust the test so that it passes, rather than fixing the underlying bug. Consul, IIRC, found errors with Jepsen, and adjusted a timeout parameter to prevent the test from observing concurrent primary nodes, which meant the test passed. The underlying problem was still there--but no longer visible. Sometimes you'll see people turn a workload that does a read-modify-write transaction into a single-statement update transaction, in such a way that the database can execute it atomically. That's still a legal and sensible test, but may not measure the same type of transactional isolation.
Other times, vendors write a custom checker, but haven't tested to make sure that it can catch the anomaly that it's supposed to. Maybe a checker is copy-pasted from another test I wrote for a different system which used different names or data structures to represent histories, so it executes successfully, but in a trivial way--maybe it fails to realize any elements were added to a set, so it assumes every read is legal, since all 0 elements are present!
Other times tests are correctly written, but only run for a short time--maybe the nemesis schedule they designed doesn't actually cause leader elections to take place, so you never observe that phase change in the system. Or the test only performs a handful of operations and sits idle until the end of the test. Or every request fails, and the test passes trivially, since we never observed an illegal result. Some of these mistakes I can address with better analyzers and prompting users with guided error messages. Others you have to qualitatively infer by looking at the graphs and histories. That's something I work on training people to do in my classes, but I haven't written good guides for it online yet.
And of course, there's a familiarity problem: with Jepsen, I've been working on and with this tool for six years, so I'm intimately familiar with its behavior and testing philosophy. So much of that knowledge is implicit and intuitionistic for me, and users who are adopting that tool fresh don't have the benefit of that experience! It's something you can build with time, and I can help transfer with writing and teaching. Working on it! :)
Writing tests is also great! Pick a system you like and dive in. There's a full tutorial (https://github.com/jepsen-io/jepsen/tree/master/doc/tutorial) that walks you through designing and running a test from scratch.
The old school method used to be a system with a remote controlled power switch. Run your test suite and regularly trip the power.
But automated crash testing systems exist today from research:
I've used it so far to find and deterministically reproduce situations where the Haskell compiler and build tools wrote files in non-atomic/non-durable ways that would lead to failures when the machine was hard-rebooted at the wrong time.
An example test is https://github.com/nh2/hatrace/blob/7300dbf2c/test/HatraceSp..., and the compiler test is just below.
So, the thing I want is the ability to take a log (which probably also has checkpoints in it, I dunno) captured by normal operation of a database, and then run a super long slow test that puts the (fake) filesystem into a state as of various points along that log where some not-yet-fsync'd changes are lost, on a sector-by-sector basis (most systems rely on at least 512 sector pages being atomic, so you'd probably do it at that size). Then you'd test if the database can recover successfully and pass some sanity test at all those points along the log.
People do this kind of testing by pulling the power on busy databases. This would simulate pulling the power at a huge number of times, on a maximally non-forgiving filesystem/hardware (= 512 byte sector atomicity, flushed in random order).
Also, awesome work! Also, I really enjoyed the Goto Nights meetup you presented at in Chicago last summer.
When I have spare time (one can dream, right?) gonna try porting this to panic-server.
I've noted TiDB's weak transaction isolation guarantees before, which were not well documented. In particular, they claimed phantom reads were not possible, which this test shows isn't true.
That said, it looks like these issues aren't fundamental design problems and can be fixed.
It pains me to say this, but strong ties to China company-wide will likely mean my org never adopts this, fk constraints aside. A victim of the trade war? Yes. But many of our clients are strictly "no Chinese vendors for libraries or dependencies", even if the data is on our own servers.
I'm not sure if anything can be done at this point, but building an isolated US based org would help immensely with adoption. The cap table also being mostly Chinese investors is enough to stop most US companies at the DD phase
"We have not evaluated filesystem or disk faults with TiDB, and cannot speak to crash recovery. Nor have we tested dynamic membership changes. Both might be fruitful avenues of investigation for future research."
How do you choose what components to test once you get past obvious stuff like the highest claimed isolation level?
tl;dr: Chaos Engineering
Regarding the Fault Injection tools we are using:
- Kernel Fault Injection, the Fault Injection Framework included in Linux kernel, you can use to implement simple fault injections to test device drivers.
- SystemTap, a scripting language and tool diagnose of a performance or functional problem.
- Fail, gofail for go and fail-rs for Rust
- Namazu: a programmable fuzzy scheduler to test a distributed system.
We also built our own Automatic Chaos platform, Schrodinger, to automate all these tests to improve both efficiency and coverage