PingCAP CTO here, I really enjoy the cooperation with Kyle this two months. I want to talk a little bit about my feelings.
Before the "official" Jepsen test, TiDB has already written its own Jepsen test cases for a long time. Our past Jepsen test code has become a good starting point in this test. In addition, as a result, the test case written by the database developer will inevitably have some blind spots. This is even if you have your own Jepsen test passed, I still recommend you to do the official test by Kyle's team.
In the process of testing, we have no way to influence what Kyle tests. Usually, only Kyle will tell us when abnormal behaviour occurs. Our main job is to analyze the cause and then fix it.
Here are some tips for recommending friends who want to do Jepsen test in the future:
1. Carefully check your documentation before testing to make sure it is consistent with your database implementation.
2. It is best for someone on the team to understand and build your own Jepsen test environment internally and test parallelly with the Kyle team.
3. Simplified deployment process is very helpful
4. Be sure the version of your DB before testing. Usually, you can select one when starting the test and select one before the end.
I really enjoy working with Kyle. I would say Kyle has a good sense of humor :), and I hopes to have another chance to cooperate Kyle and his team in the future! Thank you.
@aphyr I love reading these discussions about Jepsen! How would an individual contributor get involved with Jepsen and contribute during free time? Is most of the work related to adding public Jepsen tests for each database or is there work available with the framework as well? I'd love to learn more about Jepsen and distributed systems testing!
Hi there! There's work to do on both, and I encourage you to contribute to core if you like! We could really use performance improvements on the checkers--Knossos and set-full are great, but much more expensive than they need to be. It'd be nice to have a robust mechanism for adjusting CLOCK_MONOTONIC & friends. I don't have a good way to inject filesystem level faults, like simulated node crashes where non-fsynced data is lost. jepsen.control is plagued by what I think are race conditions in Jsch, which I've never had the chance to dig in and fix.
Database hacker here. I have dreamed of a filesystem fault injection system like this: a fuse filesystem that just passes through to a real filesystem, but also writes a log of syscalls. Then a test mode that can simulate crashes at various points in the log history, with pages (of some size) flushed at random (or other settings) until fsync is issued and completes. This could be used for recovery testing, to looking for missing fsyncs and torn page resilience.
I've used it so far to find and deterministically reproduce situations where the Haskell compiler and build tools wrote files in non-atomic/non-durable ways that would lead to failures when the machine was hard-rebooted at the wrong time.
Nice! Yeah, syscall interception may be a better way to do this than fuse.
So, the thing I want is the ability to take a log (which probably also has checkpoints in it, I dunno) captured by normal operation of a database, and then run a super long slow test that puts the (fake) filesystem into a state as of various points along that log where some not-yet-fsync'd changes are lost, on a sector-by-sector basis (most systems rely on at least 512 sector pages being atomic, so you'd probably do it at that size). Then you'd test if the database can recover successfully and pass some sanity test at all those points along the log.
People do this kind of testing by pulling the power on busy databases. This would simulate pulling the power at a huge number of times, on a maximally non-forgiving filesystem/hardware (= 512 byte sector atomicity, flushed in random order).
Thank you! Clojure is a natural fit for Jepsen: it has access to broad library support (key for testing lots of different databases), makes data manipulation easy (simplifies writing checkers), and has a good concurrency story (important for a concurrent testing system). It's got reasonable performance and a very expressive language core.
This is great, I've been evaluating CockroachDB for a high-write data situation, and TiDB came up as an alternative. CRDB having a Jepsen test and TiDB not having one made the decision pretty favored to CRDB. Looks like I can really give both a look now.
Most Jepsen reports seem to find glaring flaws, including subsystems that add complexity while reducing correctness, like the “auto-retry” features here, or novel consensus schemes that just aren’t airtight and therefore technically don’t work.
I think Bryan Cantrill put it well in one of his more recent talks. Jepsen reports used to be a lot more fun because projects made grand claims first, and then Jepsen came to actually test than. Now projects run Jepsen ahead of time to qualify their claims so there's less big GOTCHA! moments.
So, less fun but better quality products. A few more years and maybe Jepson won't find many flaws at all as they'll be identified in development using the same tools.
I agree, but I should also note that many of the recent analyses I've published were on systems which wrote their own tests and missed significant errors. Like, "database loses updates frequently in normal operation" significant!
I think a bunch of factors are in play here. Sometimes vendors discover bugs with Jepsen, then adjust the test so that it passes, rather than fixing the underlying bug. Consul, IIRC, found errors with Jepsen, and adjusted a timeout parameter to prevent the test from observing concurrent primary nodes, which meant the test passed. The underlying problem was still there--but no longer visible. Sometimes you'll see people turn a workload that does a read-modify-write transaction into a single-statement update transaction, in such a way that the database can execute it atomically. That's still a legal and sensible test, but may not measure the same type of transactional isolation.
Other times, vendors write a custom checker, but haven't tested to make sure that it can catch the anomaly that it's supposed to. Maybe a checker is copy-pasted from another test I wrote for a different system which used different names or data structures to represent histories, so it executes successfully, but in a trivial way--maybe it fails to realize any elements were added to a set, so it assumes every read is legal, since all 0 elements are present!
Other times tests are correctly written, but only run for a short time--maybe the nemesis schedule they designed doesn't actually cause leader elections to take place, so you never observe that phase change in the system. Or the test only performs a handful of operations and sits idle until the end of the test. Or every request fails, and the test passes trivially, since we never observed an illegal result. Some of these mistakes I can address with better analyzers and prompting users with guided error messages. Others you have to qualitatively infer by looking at the graphs and histories. That's something I work on training people to do in my classes, but I haven't written good guides for it online yet.
And of course, there's a familiarity problem: with Jepsen, I've been working on and with this tool for six years, so I'm intimately familiar with its behavior and testing philosophy. So much of that knowledge is implicit and intuitionistic for me, and users who are adopting that tool fresh don't have the benefit of that experience! It's something you can build with time, and I can help transfer with writing and teaching. Working on it! :)
I've noted TiDB's weak transaction isolation guarantees before, which were not well documented. In particular, they claimed phantom reads were not possible, which this test shows isn't true.
That said, it looks like these issues aren't fundamental design problems and can be fixed.
I don't think these particular phantoms are likely to be fixed--they're legal under SI, and that's all TiDB is intended to provide, at least right now. They might do serializability later though!
This is great news. The only blocker is lack of FK constraints. I know this is a performance issue, but it's a blocker for those expecting it to behave like MySQL. We have a lot of bad queries (unfortunately) that rely on constraints for correctness. It's too dangerous to do any migration that eliminates them without at least a reduced performance setting to warn about violations that we can run for a month or two before committing 100%.
It pains me to say this, but strong ties to China company-wide will likely mean my org never adopts this, fk constraints aside. A victim of the trade war? Yes. But many of our clients are strictly "no Chinese vendors for libraries or dependencies", even if the data is on our own servers.
I'm not sure if anything can be done at this point, but building an isolated US based org would help immensely with adoption. The cap table also being mostly Chinese investors is enough to stop most US companies at the DD phase
"We have not evaluated filesystem or disk faults with TiDB, and cannot speak to crash recovery. Nor have we tested dynamic membership changes. Both might be fruitful avenues of investigation for future research."
How do you choose what components to test once you get past obvious stuff like the highest claimed isolation level?
I give suggestions to my clients, and ask them what they care about testing most, and we... pretty much always agree on what direction the research goes next. FS/disk faults I don't have good tooling for yet, so I'd have to invest some time into developing those tools--makes sense to defer that work. Membership changes are infrequent and surprisingly difficult to write automation for, so I think it makes more sense to focus on the static-cluster case first.
- Kernel Fault Injection, the Fault Injection Framework included in Linux kernel, you can use to implement simple fault injections to test device drivers.
- SystemTap, a scripting language and tool diagnose of a performance or functional problem.
- Fail, gofail for go and fail-rs for Rust
- Namazu: a programmable fuzzy scheduler to test a distributed system.
We also built our own Automatic Chaos platform, Schrodinger, to automate all these tests to improve both efficiency and coverage
I don't see what's merciless about it. DB systems have various claims, he just goes about asserting if those claims are true. I think a serious take away from these is that all DB systems have issues, there are some faults that do happen, know them and design for it. Any DB that claims to have no fault is lying.
Hi @aphyr. I'm a great fan of your work with Jepsen although I know very little about the fault tolerance of distributed systems. Are there any resources you would recommend on the subject? I am an application developer so I don't see myself writing a database in the future. Still it would be great to learn about the concepts.
In the process of testing, we have no way to influence what Kyle tests. Usually, only Kyle will tell us when abnormal behaviour occurs. Our main job is to analyze the cause and then fix it. Here are some tips for recommending friends who want to do Jepsen test in the future:
1. Carefully check your documentation before testing to make sure it is consistent with your database implementation.
2. It is best for someone on the team to understand and build your own Jepsen test environment internally and test parallelly with the Kyle team.
3. Simplified deployment process is very helpful
4. Be sure the version of your DB before testing. Usually, you can select one when starting the test and select one before the end.
I really enjoy working with Kyle. I would say Kyle has a good sense of humor :), and I hopes to have another chance to cooperate Kyle and his team in the future! Thank you.