
Sled: Embedded Database Written in Rust - adamnemecek
https://github.com/spacejam/sled
======
PudgePacket
There was an awesome talk very recently about Sled at FOSDEM, highly
recommend.

[https://fosdem.org/2020/schedule/event/rust_techniques_sled/](https://fosdem.org/2020/schedule/event/rust_techniques_sled/)

Talk breif:

sled and rio

modern database engineering with io_uring

sled is an embedded database that takes advantage of modern lock-free indexing
and flash-friendly storage. rio is a pure-rust io_uring library unlocking the
linux kernel's new asynchronous IO interface. This short talk will cover
techniques that have been used to take advantage of modern hardware and
kernels while optimizing for long term developer happiness in a complex,
correctness-critical Rust codebase.

------
dragonsh
Good to see some more activity in embedded data space. Earlier used BerkleyDB
[1], msql [2] (from where MySQL was built), Postgres 4.2 [3] and now settled
on SQLite for all our embedded database needs. When need performance at
present just use SQLite in memory db.

Will give Sled a try when it reaches 1.0.0 as at present SQLite serves our
embedded requirements.

[1] [https://launchpad.net/berkeley-db](https://launchpad.net/berkeley-db)

[2]
[https://hughes.com.au/products/msql/](https://hughes.com.au/products/msql/)

[3]
[https://dsf.berkeley.edu/postgres.html](https://dsf.berkeley.edu/postgres.html)

~~~
ComputerGuru
Make sure to check out lmdb if you haven’t. It’s not relational and has an
extremely anemic API but it makes a great low-level building block. Sled is
similar I’m that regard, very different from the others you have named.

~~~
willvarfar
I've seen some startling performance claims about LMDB, but don't know anyone
using it.

Why isn't LMDB used as an engine for e.g. MySQL? Why, for example, did
Facebook go with LevelDB in MyRocks instead of LMDB?

~~~
jnwatson
My company uses LMDB extensively. LevelDB is a LSM database. LSM is generally
better for write workloads over BTree DBs like LMDB and sled. LMDB also has a
single writer restriction.

------
jiggawatts
Apparently if you're on Windows you don't deserve performance:

    
    
        #[cfg(windows)]
        const MAX_THREADS: usize = 16;
    
    
        #[cfg(not(windows))]
        const MAX_THREADS: usize = 128;
    

From:
[https://github.com/spacejam/sled/blob/bf37bd120fbf62f78408e0...](https://github.com/spacejam/sled/blob/bf37bd120fbf62f78408e04970705ea57563acfa/src/threadpool.rs#L14-L18)

~~~
pilif
also great commit message:
[https://github.com/spacejam/sled/commit/b7a5d14399540daa433a...](https://github.com/spacejam/sled/commit/b7a5d14399540daa433a1c9352359e3664e41967)

the limit to 16 threads under windows was added in a commit named "refactor
imports".

If somebody in the future tries to find out why the threads are limited to 16
under windows, they will eventually end up at that commit and if the committer
isn't around any more (or doesn't remember), nobody will ever know the
reasoning.

For projects that have even the slightest chance of staying around, be mindful
about your commit messages and explain _why_ you are doing something, not
_what_ you are doing - that's what the code itself is for.

~~~
hnarn
> be mindful about your commit messages and explain why you are doing
> something, not what you are doing

It's sad to see how many experienced programmers do not understand this. Of
course, it takes a little more effort, but without this type of thinking you
might as well skip the commit message completely. I am a very junior
programmer and it was one of the first thing I was corrected for not doing,
and rightly so.

~~~
andy_ppp
Code for your future self in 6 months time...

------
Ericson2314
I'd love to see someone go for a "exo-db" architecture, where the layers are
broken out into distinct crates. Expressiveness making this possible should be
a main benefit of Rust on par with safety.

c.f.
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.435...](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.435.6095&rep=rep1&type=pdf)

------
dang
A thread from 2018:
[https://news.ycombinator.com/item?id=17170733](https://news.ycombinator.com/item?id=17170733)

------
kstenerud
> If you want to store numerical keys in a way that will play nicely with
> sled's iterators and ordered operations, please remember to store your
> numerical items in big-endian form. Little endian (the default of many
> things) will often appear to be doing the right thing until you start
> working with more than 256 items (more than 1 byte), causing lexicographic
> ordering of the serialized bytes to diverge from the lexicographic ordering
> of their deserialized numerical form.

While I can understand that big endian ordering is convenient for size-
indifferent lexical ordering, is the tradeoff worth it? Yes, your benchmarks
will look better, but real-world usage will require clients to continually
byte swap. So you've basically forced a performance drop into user code. Also,
wouldn't size-aware compares perform better in the generated machine code
anyway?

~~~
rapsey
Turning around 8 bytes is a miniscule cost compared to the cost of having a
badly formed index. There really is no way around data not being big endian.

~~~
krenoten
yeah, le-be conversions are not generally measurable above noise compared to
other database-related work.

sled stores arbitrary bytes. endianness is the concern of the person who wants
to store higher-level types than bytes, like integers. I do imagine having a
story for letting people deserialize/view bytes once, and having that view sit
in cache so that hot items are not repeatedly deserialized.

I agree with Adya's work " Fast key-value stores: An idea whose time has come
and gone "
[https://research.google/pubs/pub48030/](https://research.google/pubs/pub48030/)
where it makes the case that stateful systems should not have to pay repeated
deserialization and network costs. I want sled to be well-situated for the
world that we're headed into, where this view will become more prominent. This
means caching deserialized data, better replication stories, and maybe nice
helpers for distributed database authors that allow for atomic tree splits /
merges and per-tree replication settings.

------
cft
Not sure if this this still relevant, but sled had a memory leak that
precluded us from using it for an actual production project 3 months ago:

[https://github.com/panicfarm/sltest](https://github.com/panicfarm/sltest)

~~~
shakna
The reproducible bug is great, and I went looking to see if it might have been
fixed, but I didn't find the same bug in the list.

Did you let sled know about the bug? Can you point me in that direction?

~~~
cft
Yes, I wrote in sled's discord on Nov 14 and got this from the author:

hey @panicfarm, thanks for the reproduction code! I think this is the same
issue that kerugma ran into and mentioned a couple days ago. I'm curious if it
still happens on sled master? I just did a huge refactor and sometimes a bunch
of issues just shake out. Anyway, I will turn this into a proper regression
test and get the page_out method of PageCache to properly be called, which is
the underlying root.

------
stbtrax
What kinds of systems is this used in? I'm assuming it's not for embedded
systems at the microcontroller level?

~~~
nemothekid
No this would be used in places where you would use SQLite or rocksdb

~~~
spectramax
Are there databases for embedded applications, such as ARM microcontrollers?

~~~
giancarlostoro
Probably SQLite if not just flat files

~~~
shakna
Putting SQLite onto a controller that isn't running a full OS is actually a
lot of work. [0] There isn't a working RTOS port out there, last I checked.

[0] [https://sqlite.org/custombuild.html](https://sqlite.org/custombuild.html)

~~~
giancarlostoro
Well yeah, depends on the boards limitations.

------
Ericson2314
Why no Tree<K,V>? It would be very nice to say up front one wants fixed-sized
keys and values and get the benefits.

This whole hurr drrr arbitrary data k v thing never made sense to me. Truly
Unstructured data is meaningless--that's information theory for you---so why
not leverage some types?

~~~
krenoten
The best solution is to use a 0-copy format like flatbuffers, google's
zerocopy library, or serde's borrow attributes
[https://serde.rs/lifetimes.html#borrowing-data-in-a-
derived-...](https://serde.rs/lifetimes.html#borrowing-data-in-a-derived-impl)

You're totally right that data is meaningless without the ability to use it.
This is something I've been thinking about a lot, because it involves a lot of
trade-offs that definitely were not apparent to me before I started building
this.

Ultimately, I do want to make it easy to write a Tree<K, V> on top of sled,
but if I were to do it myself it would significantly reduce performance and
impose restrictions on users that don't feel appropriate to me.

* Being a lock-free B+ tree, there are a ton of cool techniques we can use on bytes that would not apply to arbitrary K types. For instance, all keys stored in a tree node are prefix encoded by the node's low key. This allows very long keys with common prefixes to be very cheaply stored, which is useful for things like F1-like embedded tables. We also do key truncation, where when we do a node split, we chop off the bytes necessary to actually differentiate between one side and the other, which further reduces the size of the index. Prefix encoding and suffix truncation are the bread and butter of modern B+ tree implementations, and that would hurt a lot to walk away from.

* All K and V types must then be serializable and deserializable. In rust we represent this with traits that come from specific external libraries. serde took up half of the sled compilation time when it was being used, so I don't want to pull that back in as a mandatory requirement for all users. I could have my own traits for this, and if I do anything like this, it will be the approach I take, and then external crates will implement serde-sled etc... but by depending on external crates, it makes for a very brittle API surface that requires all users to target the exact same version of that external trait. The core must be self-consistent and not export any external types that would cause sharp dependency issues.

* I like the idea of caching values that are deserialized only once, to avoid repeated deserialization costs on hot items. Something like this may be implemented in sled, but I haven't figured out the best way to represent it without causing memory issues etc... And in the mean time I'm more tempted to just point people to the 3 solutions above that let them view their bytes as structured data without many deserialization costs.

~~~
Ericson2314
In the spirit of my other comment
[https://news.ycombinator.com/item?id=22375979](https://news.ycombinator.com/item?id=22375979)
it would be nice to say the layers broken out into separately usable
abstractions. Ideally, this means layering, K and V type params, _and_ no loss
of performance.

> Prefix encoding and suffix truncation

At the very least, you can do all this stuff in the key is like [u8]/Vec<u8>
case. But maybe also [T]/Vec<T>? I love to look at my existing monomorphic
algorithms, and think "what traits would this require to make this as abstract
as possible"? Almost never is the answer "sorry, it cannot be generalized at
all".

> All K and V types must then be serializable and deserializable....and if I
> do anything like this, it will be the approach I take.....

I agree with your own traits. Per the above your algorithms come first, not
other people's abstractions. But I don't know why you can't just do that and
ignore serde entirely. Serde becomes someone else's problem, and if they don't
want to bother with it you will provide the [u8]/Vec<u8> instances.

> And in the mean time I'm more tempted to just point people to the 3
> solutions above that let them view their bytes as structured data without
> many deserialization costs.

Agreed. If you data structure has a pointer, that should be a foreign key in
my book (though not necessarily in the same direction!). It's best if you can
just memcopy the row, more or less.

[I say the pointer thing not only because marshaling cost, but also data
modeling. Huge rows / json blogs just don't make sense to me. There almost
surely is some structure to the nested data, and that deserves to be
formalized and enforced in its own index. Ignore what they told you in SQL
class. Indexes express/reify invariants, and foreign keys act through
indices.]

