
Pebble: A RocksDB Inspired Key-Value Store Written in Go - dilloc
https://www.cockroachlabs.com/blog/pebble-rocksdb-kv-store/
======
malisper
So I understand the rationale for writing your own storage layer and think
this is an awesome project, but there's something missing for me. One of the
issues Peter brings up is they've come across a number of serious bugs in
RocksDB. My question is, why would Pebble have less bugs. In fact, I would
expect it to have significantly more bugs because Coackroach is the only
company using Pebble.

They mention briefly how they are going about randomized crash testing:

> The random series of operations also includes a “restart” operation. When a
> “restart” operation is encountered, any data that has been written to the OS
> but not “synced” is discarded. Achieving this discard behavior was
> relatively straightforward because all filesystem operations in Pebble are
> performed through a filesystem interface. We merely had to add a new
> implementation of this interface which buffered unsynced data and discarded
> this buffered data when a “restart” occurred.

but this seems to only scratch the surface of possibilities that can come up
with a crash. For example, it's possible the filesystem had synced some of the
buffered data to disk, but not all of it. There's no guarantee about what
buffered data was synced to disk. All you know is that some, all, or none of
it made it to disk.

Bugs in this area are still regularly found in e.g. Postgres, so I'm having a
hard time seeing how Coackroach is making sure Pebble doesn't have similar
problems.

~~~
d--b
Well, I think what they're saying is that they'd rather have bugs in code
they've written than in code that is written by other people and in another
language, and for which they don't control the patching pipeline.

If RocksDB had had no bugs, they wouldn't have needed to write Pebble.

~~~
pdpi
That's an argument for them using it, but it's also basically arguing why
nobody else should.

~~~
strken
Avoiding cgo would be a selling point for anyone else using go. Presumably
other pure go kv stores like bbolt/badger/goleveldb would also solve that
problem, but I don't know enough about them to understand the trade-offs.

------
chaosharmonic
Was anyone else deeply saddened about three words into the headline, on
realizing this wasn't a watch? (RIP)

~~~
jsight
I definitely was. I still use one now, more than 3 years after their business
failure. I really wish someone would make something like the Pebble Time 2.

~~~
m-p-3
I have an Amazfit Bip, but the UI isn't as good as Pebble sadly.

There is some work in making a similar OS called RebbleOS[1] currently
ongoing.

[1]: [https://github.com/pebble-dev/RebbleOS](https://github.com/pebble-
dev/RebbleOS)

Hopefully it will be portable to other low-end smartwatches.

~~~
Polylactic_acid
3 years later and it looks like they are just at the trying to get hardware
features accessible. At this rate all pebble devices will have died / been
discarded before its able to show a notification on your wrist.

~~~
m-p-3
Hopefully the OS could run on something newer than a Pebble, like the
PineTime.

------
willvarfar
I’ve run into serious house burning down problems with myrocks too. Simple
recipe to crash MySQL in a way that is unrecoverable: do ALTER TABLE on a big
table and it runs out of RAM, crashes, and refuses to restart, ever.

Googling and people have been reporting the error on restarting several times
on lists and things. What help is it to report to Maria dB or something? But
do FB notice? Seems not.

Here’s hoping someone at FB browses HN...

I don’t get why FB don’t have some fuzzing and chaos monkey stress test to
find easy stability bugs :(

~~~
yoshinorim
I am one of the creators of MyRocks at FB. We have a few common MySQL
features/operations we don't use at FB. Notably:

1) Schema Changes by DDL (e.g. ALTER TABLE, CREATE INDEX)

2) Recovering primary instances without failover

We use our own open source tool OnlineSchemaChange to do schema changes
(details: [https://github.com/facebook/mysql-5.6/wiki/Schema-
Changes](https://github.com/facebook/mysql-5.6/wiki/Schema-Changes)), which is
heavily optimized for MyRocks use cases like utilizing bulk loading for both
primary and secondary keys. ALTER TABLE / CREATE INDEX support in MyRocks is
limited and suboptimal -- it does not support Online/Instant DDL (so blocking
writes to the same table during ALTER), and enters non bulk loading path and
trying to load the entire table in one transaction -- which may hit row lock
count limit or out of memory. We have plans to improve regular DDL paths in
MyRocks in MySQL 8.0, including supporting atomic, online and instant schema
changes.

I am also realizing that a lot of external MySQL users still don't have auto
failover and try to recover primary instances if they go down. This means
single instance availability and recoverability is much more important for
them. We set rocksdb_wal_recovery_mode=1 (kAbsoluteConsistency) by default in
MyRocks, which actually degraded recoverability (higher chances to refuse to
start even if it can be recovered from binlog). We're changing defaults to 2
(kPointInTimeRecovery) so that it can be more robust without relying on
replicas for recovery.

It would have been a really bad experience when hitting OOM by 1) then failing
to restart because of 2). We have relations with MariaDB and Percona, and will
make default behavior better for users.

~~~
willvarfar
Thanks for explaining this! Really appreciate that you joined in here.

We've been test running our real-time dwh etls on myrocks (and postgres and
timescale and even innodb) to comppare with our previous workhorse, tokudb.
We've chewed through cpu years iterating over every switch and setting we can
think of, to find optimum config for our workloads.

Like for example we've found that myrocks really slows down if you do a SELECT
... WHERE id IN (....) from too long a list of ids.

So we have lots of thoughts and data points on things my team have found easy,
hard, painful, better etc. I'd be happy to share with you folks.

(FWIW we are moving from tokudb to myrocks now, with tweaks to how we do data
retention and gdpr depersonalization and things)

Ping me on willvarfar at google's freemail domain if that's useful!

------
cube2222
How does this compare to Badger[0], another similar in nature key-value store
in Go?

What were the trade-offs which made it necessary to create something new
instead of adapting what exists?

[0]: [https://github.com/dgraph-io/badger](https://github.com/dgraph-
io/badger)

~~~
ohnoesjmr
Badger is written by mad people from my point of view, who disabled issues on
github, from my understanding declared it as "done" and "bug free", and any
issue tracking is now done on the forum where the threads roll off to the void
with no further trace.

~~~
mrjn
Wao. You describe us as “mad people” because we choose to not use GitHub
issues? Is that all it takes to dismiss an open-source software and badmouth
its authors? You have gone really low on this.

All the issues have been ported over to Discourse. And no one has declared
Badger, bug-free. I don’t know where you got that idea.

------
jasonzemos
Concurrency and multithreading are a major focus of both Go and RocksDB. This
introduction makes little mention of those areas, and I'm curious if there's
any more to be said on this. The article lists several features being
reimplemented, including:

> Basic operations: Set, Get, Merge, Delete, Single Delete, Range Delete

It makes no mention of RocksDB's MultiGet/MultiRead -- is CockroachDB/Pebble
limited to query-at-a-time per thread? I'm genuinely curious how this all
translates into Go's M:N coroutine model currently and moving forward with
Pebble.

~~~
petermattis
Pebble does not currently implement MultiGet as CockroachDB did not use
RocksDB's MultiGet operation. CockroachDB can use multiple nodes to process a
query by decomposing SQL queries along data boundaries and shipping the query
parts to be executed next to the data. CockroachDB can't directly use MultiGet
because that API was not compatible with how CockroachDB reads keys.

RocksDB MultiGet is interesting. Parallelism is achieved by using new IO
interfaces (io_uring), not by using threads. That approach seems right to me.
See [https://github.com/facebook/rocksdb/wiki/MultiGet-
Performanc...](https://github.com/facebook/rocksdb/wiki/MultiGet-Performance).
My understanding is that io_uring support is still a work in progress. We
experimented at one point with using goroutines in Pebble to parallelize
lookups, but doing so was strictly worse for performance. Experimenting with
io_uring is something we'd like to do.

~~~
jasonzemos
Indeed the conceptual fork point mentioned is RocksDB 6.2.1 which came before
those features. The problem with RocksDB is that one thread only makes one
request at a time. I should've phrased my question more succinctly: Is
Pebble/CockroachDB capable of saturating the backplane with requests in
parallel? Does it multiplex a single query by dispatching smaller requests to
a thread-pool?

~~~
petermattis
> Is Pebble/CockroachDB capable of saturating the backplane with requests in
> parallel?

Yes.

> Does it multiplex a single query by dispatching smaller requests to a
> thread-pool?

Yes, though it depends on the query. Trivial queries (i.e. single-row lookups)
are executed serially as that is the fastest way to execute them. Complex
queries are decomposed along data boundaries and the query parts are executed
in parallel next to where the data is located.

------
djhworld
Really enjoyed reading this, thanks.

Would be interested to see if the garbage collector has presented any problems
when running in production

~~~
tyingq
There's some notes on that here:
[https://github.com/cockroachdb/pebble/blob/c39589c8cb36d95df...](https://github.com/cockroachdb/pebble/blob/c39589c8cb36d95df29b37e85ec2b4c3e20273dc/docs/memory.md)

~~~
petermattis
The TLDR is that the GC did cause problems so we had to avoid it for the block
cache. Luckily we were able to do so without exposing the complexity in the
API. Not for the faint of heart. Don't try this at home kids.

~~~
throwdbaaway
Any reason why the block cache needs to be 10s of GB in size? Cassandra, for
example, usually has a rather small key cache on heap, and then just relies on
the kernel page cache. I don't have experience with cockroach, it looks like
the block cache is similar to cassandra row cache, which can be configured to
be on heap or off heap, but usually not beneficial to enable.

------
mholt
Not to be confused with Let's Encrypt's ACME client testing CA server project
(the "scaled down" version of Boulder), Pebble:
[https://github.com/letsencrypt/pebble](https://github.com/letsencrypt/pebble)

------
fis
The name is a little close to this existing LevelDB fork, maybe consider a
different name?
[https://github.com/utsaslab/pebblesdb](https://github.com/utsaslab/pebblesdb)

~~~
petermattis
Damned for using a unique name (CockroachDB), and damned for using an
innocuous one.

PS PebblesDB was a research project and is dead as far as I know.

------
2020-09-15-tmp
Thought it would be worth mentioning Sled as an alternative to RocksDB for the
Rust crowd:

[https://github.com/spacejam/sled](https://github.com/spacejam/sled)

~~~
silasb
Worth mentioning that there is
[https://crates.io/crates/rocksdb](https://crates.io/crates/rocksdb) which
Rust bindings to RocksDB.

------
LaserToy
I hope you folks know what you are doing. If your screw it up you will have a
lot of angry former customers, us including.

Maybe less aggressive rollout strategy?

------
StreamBright
Few question comes into mind reading this:

\- what is the plan to tackle Go's GC?

It seems to me that above a certain scale people run into GC problems with
Go.[1]

\- have they considered WickedDB?[2]

It appears to be a good candidate for their need.

[https://github.com/Fullstop000/wickdb](https://github.com/Fullstop000/wickdb)

[https://blog.discord.com/why-discord-is-switching-from-go-
to...](https://blog.discord.com/why-discord-is-switching-from-go-to-
rust-a190bbca2b1f)

------
erichocean
This makes total sense for Cockroach Labs, and I trust their engineering
ability to get it right.

------
AtlasBarfed
Why would someone remove a non-GC database engine with a database engine with
GC?

Has Go evolved better low-GC features? As I understand Go GC vs JVM GC, Go
avoids major GC by simply pushing it to the future and consuming memory more
readily.

But a database is a long-running program, so you have to pay the piper
eventually.

~~~
FridgeSeal
I’d also be curious why they didn’t go with something like Foundation DB
either.

~~~
petermattis
Pebble and FoundationDB are apples and oranges. Pebble is per-node KV storage
engine. FoundationDB is a distributed multi-modal database. Internally,
FoundationDB uses a library like Pebble for the per-node data storage. I think
at one point it used SQLite. I'm not up to date on what it currently use. I
seem to recall FoundationDB was writing their own btree-based node-level
storage engine to replace the usage of SQLite.

The equivalent of FoundationDB is present inside of CockroachDB: a
distributed, replicated, transactional, KV layer. This is where a big chunk of
CockroachDB value resides. This is where our use of Raft resides. Pebble lies
underneath this.

~~~
ryanworl
The current production storage engine is an old-ish version of the SQLite
btree. A new btree engine is being written now and is available but I don’t
know if it is being used in production anywhere.

RocksDB is shipping soon thanks to some work by members of the community.

------
02020202
hm, no word on performance comparison with badger, bolt, moss, pogreb,
pudge...

~~~
cristaloleg
Solving different problems, huh? but still similar.

------
nhumrich
> written in go

Why does the implementation language matter for non-library tool? Is that its
only selling point?

~~~
johncolanduoni
It’s not a network-connected key value store so you need to interact with it
from Go. That makes a pretty big difference.

------
dfee
As a consumer, why would I want something like this written in Go vs. Rust?

Is it just that Rust is really good with developer relations? Because it feels
like to me that all new foundational technology is safer and faster in a
language like Rust, and things written in Go should be higher up the food
chain.

~~~
dfee
I really don’t understand the downvotes. I’m not experienced with either
language - and this has nothing to do with a flame war.

The question, unstated and unopinionated AND intellectually honest was: does
language impact community adoption - and if so, what are the drivers behind
it.

If I were going to write a foundational technology, I probably wouldn’t write
it in NodeJS, not that it couldn’t be done, but because I’d be concerned
mainstream adoption might suffer. For example, I’d expect a hypothetical JsSql
(a SQL engine written in JavaScript - assuming this doesn’t already exist)
would achieve lower general adoption than writing it in C++.

Get it?

~~~
brokencode
I don’t think Cockroach cares about adoption. This is not meant to be a
generally useful product in itself outside of their database. So the language
was chosen mainly based on their familiarity with Go and its ability to
integrate with their existing codebase.

I think their omission of major features such as transactions is more likely
to limit adoption than the language choice, so language choice is kind of
irrelevant from that perspective.

~~~
strken
The integration with their existing codebase might benefit from an
explanation. In Go, calls to C libraries are done using a compiler feature
called cgo, and there's a performance penalty that the cockroachdb team
measured at 171ns per call, plus you sometimes have to copy additional data
around for memory safety[0]. So far as I know, there's no way to avoid this
penalty that doesn't involve building a tool that does the same thing as cgo.

In this case, given that they already have a database written in Go, writing
the backing kv store library in Go has a clear performance benefit of 171ns
every operation. This is above and beyond just familiarity and easy
integration, although those are also important.

[0] [https://www.cockroachlabs.com/blog/the-cost-and-
complexity-o...](https://www.cockroachlabs.com/blog/the-cost-and-complexity-
of-cgo/)

