They mention briefly how they are going about randomized crash testing:
> The random series of operations also includes a “restart” operation. When a “restart” operation is encountered, any data that has been written to the OS but not “synced” is discarded. Achieving this discard behavior was relatively straightforward because all filesystem operations in Pebble are performed through a filesystem interface. We merely had to add a new implementation of this interface which buffered unsynced data and discarded this buffered data when a “restart” occurred.
but this seems to only scratch the surface of possibilities that can come up with a crash. For example, it's possible the filesystem had synced some of the buffered data to disk, but not all of it. There's no guarantee about what buffered data was synced to disk. All you know is that some, all, or none of it made it to disk.
Bugs in this area are still regularly found in e.g. Postgres, so I'm having a hard time seeing how Coackroach is making sure Pebble doesn't have similar problems.
We're only worried about functionality in Pebble used by CockroachDB. RocksDB has a huge number of features that sometimes have bugs due to subtle interactions. There is a very stable subset of RocksDB: the configuration and specific API usage patterns used internally by Facebook. That precise combination has seen extreme testing. But that isn't the subset of RocksDB used by CockroachDB. I would guess that the most significant testing of the subset of RocksDB used by CockroachDB is the testing we do at Cockroach Labs. Now that testing is being directed at Pebble along with the Pebble-specific testing detailed in the post.
> For example, it's possible the filesystem had synced some of the buffered data to disk, but not all of it. There's no guarantee about what buffered data was synced to disk. All you know is that some, all, or none of it made it to disk.
The filesystem does provide guarantees when you use fsync() and fdatasync(). Postgres relies on these guarantees. So does RocksDB. Pebble's usage of fsync/fdatasync mirrors RocksDB's. Our crash testing is not testing the filesystem guarantees, only that we're correctly using fsync/fdatasync (which is hard enough to get right).
For anyone unfamiliar, fsync/fdatasync are infamous for all sorts of subtle sharp edges: https://www.usenix.org/conference/atc20/presentation/rebello
Having synchronous replication via paxos/raft can mitigate a lot of this.
From the Rebello paper:
> However, on restart, since the log entry is in the page cache, LevelDB includes it while creating an SSTable from the log file.
Pebble and RocksDB both inherited this behavior. The nuance here is that the sstable is then synced to disk and no reads are served until the sync is successful. If the machine were to crash before the sstable was synced, upon restart we'd rollback to the durable prefix of the log.
If RocksDB had had no bugs, they wouldn't have needed to write Pebble.
In addition to randomized crash tests, we have a suite of end-to-end integration tests on top of Cockroach, called Roachtests, that put clusters under a combination of node crash/restart scenarios and confirm data consistency.
Reminds me of fsync gate: https://news.ycombinator.com/item?id=20491965 and https://news.ycombinator.com/item?id=19119991 (not implying Pebble uses fysnc incorrectly).
One of the cool things about the LevelDB codebase is the `Repository contents` section. A brief description of the most relevant modules so people can get familiarized with the code base quicker. As someone very interested in storage engines, I would love to see something similar here. Are you guys planning on adding some extra documentation to the project?
There is some work in making a similar OS called RebbleOS currently ongoing.
Hopefully it will be portable to other low-end smartwatches.
Thinking of switching to Apple watch because I have iPhone and I also want to use Apple Pay without the phone. If they had better battery life, I probably would have done it already.
Googling and people have been reporting the error on restarting several times on lists and things. What help is it to report to Maria dB or something? But do FB notice? Seems not.
Here’s hoping someone at FB browses HN...
I don’t get why FB don’t have some fuzzing and chaos monkey stress test to find easy stability bugs :(
1) Schema Changes by DDL (e.g. ALTER TABLE, CREATE INDEX)
2) Recovering primary instances without failover
We use our own open source tool OnlineSchemaChange to do schema changes (details: https://github.com/facebook/mysql-5.6/wiki/Schema-Changes), which is heavily optimized for MyRocks use cases like utilizing bulk loading for both primary and secondary keys. ALTER TABLE / CREATE INDEX support in MyRocks is limited and suboptimal -- it does not support Online/Instant DDL (so blocking writes to the same table during ALTER), and enters non bulk loading path and trying to load the entire table in one transaction -- which may hit row lock count limit or out of memory. We have plans to improve regular DDL paths in MyRocks in MySQL 8.0, including supporting atomic, online and instant schema changes.
I am also realizing that a lot of external MySQL users still don't have auto failover and try to recover primary instances if they go down. This means single instance availability and recoverability is much more important for them. We set rocksdb_wal_recovery_mode=1 (kAbsoluteConsistency) by default in MyRocks, which actually degraded recoverability (higher chances to refuse to start even if it can be recovered from binlog). We're changing defaults to 2 (kPointInTimeRecovery) so that it can be more robust without relying on replicas for recovery.
It would have been a really bad experience when hitting OOM by 1) then failing to restart because of 2). We have relations with MariaDB and Percona, and will make default behavior better for users.
The problem is that the program runs out of RAM. The challenge is to write the data and metadata in such a way that a program crashing at any point for any reason is recoverable.
This is the basic promise of the Durability in ACID, and people using MyRocks expect it.
Rather than actually making sure that MyRocks is durable, they simply slap on a 'max transaction rows' to make it unlikely you run out of RAM. Instead, you simply get an error, and can't do stuff like ALTER TABLE or UPDATE on large tables.
Of course its easy to run out of RAM despite these thresholds, and its easy to find advice when you google the error messages you get that lead you to up the thresholds and to even set a 'bulk load' flag that disables various checks you probably haven't investigated.
The whole approach is wrongheaded!
A database that crashes should not be corrupt!!! Isn't this reliability 101? Why doesn't myrocks have chaos monkey stress testing etc etc?
Because Facebook has little incentive to ensure that RocksDB works well in your use case. MyRocks was built for Facebook and anything that Facebook doesn’t do probably isn’t particularly hardened. They aren’t going to invest time doing chaos monkey stress testing on codepaths they don’t use. Things like durability might not be super important to them because they will make it up in redundancy.
I remember being burned by something similar during the early days of Cassandra. I’m sure Cockroach has hit the same bugs.
What were the trade-offs which made it necessary to create something new instead of adapting what exists?
"A final alternative would be to use another storage engine, such as Badger or BoltDB (if we wanted to stick with Go). This alternative was not seriously considered for several reasons. These storage engines do not provide all the features we require, so we would have needed to make significant enhancements to them.
Lastly, various RocksDB-isms have slipped into the CockroachDB code base, such as the use of the sstable format for sending snapshots of data between nodes. Removing these RocksDB-isms, or providing adapters, would either be a large engineering effort, or impose unacceptable performance overhead.
All the issues have been ported over to Discourse. And no one has declared Badger, bug-free. I don’t know where you got that idea.
> Basic operations: Set, Get, Merge, Delete, Single Delete, Range Delete
It makes no mention of RocksDB's MultiGet/MultiRead -- is CockroachDB/Pebble limited to query-at-a-time per thread? I'm genuinely curious how this all translates into Go's M:N coroutine model currently and moving forward with Pebble.
RocksDB MultiGet is interesting. Parallelism is achieved by using new IO interfaces (io_uring), not by using threads. That approach seems right to me. See https://github.com/facebook/rocksdb/wiki/MultiGet-Performanc.... My understanding is that io_uring support is still a work in progress. We experimented at one point with using goroutines in Pebble to parallelize lookups, but doing so was strictly worse for performance. Experimenting with io_uring is something we'd like to do.
> Does it multiplex a single query by dispatching smaller requests to a thread-pool?
Yes, though it depends on the query. Trivial queries (i.e. single-row lookups) are executed serially as that is the fastest way to execute them. Complex queries are decomposed along data boundaries and the query parts are executed in parallel next to where the data is located.
Would be interested to see if the garbage collector has presented any problems when running in production
PS PebblesDB was a research project and is dead as far as I know.
Maybe less aggressive rollout strategy?
- what is the plan to tackle Go's GC?
It seems to me that above a certain scale people run into GC problems with Go.
- have they considered WickedDB?
It appears to be a good candidate for their need.
Has Go evolved better low-GC features? As I understand Go GC vs JVM GC, Go avoids major GC by simply pushing it to the future and consuming memory more readily.
But a database is a long-running program, so you have to pay the piper eventually.
> This percentage can be configured by the GOGC environment variable or by calling debug.SetGCPercent. The default value is 100, which means that GC is triggered when the freshly allocated data is equal to the amount of live data at the end of the last collection period. This generally works well in practice, but the Pebble Block Cache is often configured to be 10s of gigabytes in size. Waiting for 10s of gigabytes of data to be allocated before triggering a GC results in very large Go heap sizes.
I would think the worst thing to program in a gc'd language would be the cache. So, if you write that layer of the database separately in a non-gc'd you would avoid most of the headache.
EDIT: I read the parent comment as why program a db in a gc language. I think the new wave 1st gen db's are doing alot of novel things. Minus the cache I imagine the velocity improvements form the 'simpler' language makes sense. Key value stores are become well trod ground however.
If you are leaking not so much. :)
The equivalent of FoundationDB is present inside of CockroachDB: a distributed, replicated, transactional, KV layer. This is where a big chunk of CockroachDB value resides. This is where our use of Raft resides. Pebble lies underneath this.
RocksDB is shipping soon thanks to some work by members of the community.
Why does the implementation language matter for non-library tool? Is that its only selling point?
Is it just that Rust is really good with developer relations? Because it feels like to me that all new foundational technology is safer and faster in a language like Rust, and things written in Go should be higher up the food chain.
If you were a Rust developer you'd want something similar written in Rust.
> CockroachDB is primarily a Go code base, and the Cockroach Labs engineers have developed broad expertise in Go.
The question, unstated and unopinionated AND intellectually honest was: does language impact community adoption - and if so, what are the drivers behind it.
I think their omission of major features such as transactions is more likely to limit adoption than the language choice, so language choice is kind of irrelevant from that perspective.
In this case, given that they already have a database written in Go, writing the backing kv store library in Go has a clear performance benefit of 171ns every operation. This is above and beyond just familiarity and easy integration, although those are also important.