Hacker News new | past | comments | ask | show | jobs | submit login
An embedded database written in Rust (github.com)
277 points by mountainview on May 28, 2018 | hide | past | web | favorite | 72 comments

Other people have had trouble wringing competitive performance out of Bw-Trees despite heroic optimization efforts [0]. Why is this implementation going to beat other index structures with just a bit of tuning?

[0] https://news.ycombinator.com/item?id=17041616

It might not. But the critiques of bw trees in terms of performance that I've seen have not had compelling data in terms of things that matter outside of academia or benchmarking shootouts, like write or space amplification. The bw tree is a cheap thing to abandon after I implement a persistent ART and measure it though. The bwtree is only like 1k of rust on top of the modular pagecache, which is the real heart of the system.

I am not familiar with bw-trees, but are you aware of this paper from this year’s SIGMOD?


Moving to ARTs (possibly combined with B+ trees) might be a smart choice after all.

Edit: sorry, I missed that grand-parent has cited the same paper already.

That paper is pretty good but it's comparing bw-tree with much simpler in memory data structures. I think bw-tress might work specially well for fast-disk storage.

This was my interpretation as well. I'm going to compare a disk-backed bwtree with a disk-backed ART, both backed by the same pagecache, and maybe end up with an ART that scatters partial pages on disk, bwtree style. But I need to measure apples to apples on the metrics that matter for storage first. The pagecache is where most of the complexity is in my implementation, and it makes building different kinds of persistent structures on top of it pretty easy. docs.rs/pagecache

>I think bw-tress might work specially well for fast-disk storage.

That's precisely the claim made in the original paper describing Bw-trees.


Nice with more developments in embedded databases!

Would be interesting with a comparison with mentat [1]. Do you plan to support any query langauge such as datalog, like mentat?

[1] https://github.com/mozilla/mentat

Yeah, I'm curious about using sled as a more ssd friendly storage engine for mentat. I'm just starting to experiment with datalog implementations, but I think by having harmony between the storage engine, query language, and hardware properties we can make a really compelling stateful systems. If this is something that interests you, I'd love to work with more people on this.

Sounds very exciting!

Sounds exciting!

"don't wake up operators. bring reliability techniques from academia into real-world practice."

What does he mean here? Don't wake up people with operational problems? Or does "wake up" refer to a scheduling strategy?

Either way, what are these techniques?

It means pay more attention to reliability than pop infrastructure and internet companies (who can offset poor reliability with human attention or intentionally deprioritize it to sell more support contacts) tend to put into these things. Specifically, exhaustive concurrency testing of lock-free algorithm interleavings via ptrace driven scheduling, model-based testing in combination with fault injection, ALICE-style file correctness testing, and for the various distributed modules that sit on top, network simulation combined with lineage driven fault injection. This is all very much a work in progress, and I'd love to work with more people on it!

I took that as a reference to pager duty.

How well does that handle the storage disappearing halfway through a write? How well does it handle power being cut halfway through an update of some sort? How well does it handle some of the written blocks actually making it to the disc and others not? How about if the ones that made it were not the first or the last it issued to be written?

(For predictable answers to these, and many other complex questions, when dealing with data you care about, use sqlite)

ALICE showed that's not always true with sqlite. Sled is being built with an extreme bias toward reliability over features, but as the readme says, it has some time to go before reaching maturity. The tests are quite good at finding new issues and deterministically replaying them, so you can help bake it in by mining bugs using the default test suite and help it get there.

Most database designers assume that a power failure will only affect writes that are pending. Alas, for SSDs and NVMEs that's not always true. A power failure can cause all kinds of corruption. Long story short: even append-only strategies will not save you.


Indeed. This is why I aggressively checksum everything and pay particular attention to throwing away all data that was written after any detected corruption during recovery. This is easier with the log-only architecture. It's also totally os and filesystem agnostic. I was happily surprised yesterday when it passed tests on fuchsia :]

So you might end up throwing away everything. (I know, it's not your fault)

Users can rely on sequential recovery. At some point I'll probably write a partial recovery tool that gives you all versions of all keys that are at all present anywhere in the readable file though, which won't be much work. Typical best practices encourage moving away from single disk reliance for particularly valuable data, but this library will also work on phones etc... So it is important to support people when a wide variety of things go wrong.

That paper sounds like problems that can not be worked around in software and need hardware fixes?

You could work around it (erasure coding, fe), if you had insights on the failure mode specifics, but it's vendor specific and vendors are not exactly forthcoming.

So the only thing you can do is add checksumming schemes that allow you to detect you have been affected.

The paper is from 2013, so the situation might have improved meanwhile (I wouldn't put any money on it)

>How well does that handle the storage disappearing halfway through a write? How well does it handle power being cut halfway through an update of some sort?

Or the drive catching fire and the user manually deleting all existing backups.

I see that MVCC is a planned feature. Why would you need MVCC for an embedded database? It seems like unnecessary overhead that conflicts with the performance goals.

It's not needed for the single-key atomic record store, which is the sled bwtree index that is the current highest level module. MVCC is implemented in most popular embedded DBs because it is an effective way to manage mixed workloads that seek to read snapshots of the entire database at a single point in time as well as not blocking writes as this happens. This functionality is desirable for transactions that support mixed workloads. That's why I'm building it for a higher level module. This is a collection of modules that let you choose the abstraction and associated complexity that you want. There's also a modular pagecache that is totally decoupled and reusable for your own database experiments.

MVCC is a viable general purpose approach to deal with concurrency. It is the mother of all DAGs.

I think they mean "embedded" as in "embedded in a program", not "targeting embedded hardware". In particular the README states they intend to use it in https://github.com/spacejam/rasputin. This is a use case similar to LMDB, for instance, which has MVCC.

In Rust abstractions are free.

Some are (more or less) free, others are not. Rust simply gives you some great tools to make the choices yourself. The biggest win tends to be that you can get away without a GC, but still keep the guarantee that you're not leaking memory by forgetting to collect it.

Rust doesn't guarantee freedom of leaks, as leaking is memory-safe.

To be fair, I think the way GP worded it might still be accurate; they said that you wouldn't leak by forgetting to collect it. I took that as a statement about not needing to manually free things than an assertion that Rust guarantees no memory leaks (which can still occur from things like reference cycles).

You’re right, I might be overly sensitive to people incorrectly making this claim directly.

I too try (and sometimes fail) to be careful when talking about this. People get all hot and bothered when you go overboard :P It can be tempting to start explaining what an afine type system is, but that can hurt more than help.


If that was so, there wouldn't have been a need for the unsafe mode.

Unsafe does not exist for performance.

Quoting from the rust book:

> Unsafe Rust exists because, by nature, static analysis is conservative. When the compiler is trying to determine if code upholds the guarantees or not, it’s better for it to reject some programs that are valid than accept some programs that are invalid. That inevitably means there are some times when your code might be okay, but Rust thinks it’s not! In these cases, you can use unsafe code to tell the compiler, “trust me, I know what I’m doing.” The downside is that you’re on your own; if you get unsafe code wrong, problems due to memory unsafety, like null pointer dereferencing, can occur.

> There’s another reason Rust has an unsafe alter ego: the underlying hardware of computers is inherently not safe. If Rust didn’t let you do unsafe operations, there would be some tasks that you simply could not do. Rust needs to allow you to do low-level systems programming like directly interacting with your operating system, or even writing your own operating system! That’s one of the goals of the language. Let’s see what you can do with unsafe Rust, and how to do it.

As the person who wrote those paragraphs, yes. The fundamental reason unsafe exists is not for performance reasons. That being said, the parent is also true that sometimes, unsafe is used to enhance performance. But that's not the usual case, and in fact, can even hurt performance in ways. For example, &mut T can't alias, but *mut T can, and so &mut T can be optimized more aggressively. (We recently turned these optimizations back on: https://github.com/rust-lang/rust/pull/50744 )

You're argument doesn't really make sense. Not all abstractions are workarounds for "unsafe" things. For example, unsafe is needed for dealing with the OS, no amount of abstraction will help you there. At some point you must talk to and accept memory that another process (OS) tells you is correct. I'm not so knowledgeable in Rust so I can't say definitely, but I think you should back up your statement with some (any) examples.

Do you realize "zero-cost" abstractions are not specific to Rust?

Your parent didn't say that. Speaking about Rust in a thread about a Rust project seems to be very on-topic?

(I would also disagree with that statement, but in a different way: Rust enables zero-cost abstractions, but nothing inherently means they have to be. I can also make quite costly ones, and they'll compile.)

> Your parent didn't say that.

Yes and it's more disturbing. It says "In Rust abstractions are free" when an algorithm was discussed.

Rust having "free" abstractions does not change the performance characteristics of an algorithm: this is what the grand-parent was talking about.

Yes, I certainly agree with that distinction. If you'd have said that in the first place, I wouldn't have said anything :)

What is results disturbing is that you are arguing just for the sake of arguing. "Algorithm" is a too wide term - everything is an algorithm, especially in programming. Talk was about abstractions, now you're trying to say it was about algorithms. We need "ignore" button here.

I would answer "you can write trash in any language, in Rust you need to find ways how to do it, when in other languages - how to avoid it". Looks like ycombinator now is the place when people just looking for negativity and what else ti disagree with.

How does it compare relatively to dgraph's Badger library?

Considering this has not even exited alpha, it stands to reason that Badger is a fair bit more robust in performance and failure modes.

There is zero reason behind this assumption.

Does rust compile reliably to embedded targets yet? Last time I checked there were a lot of problems with armv5.

I'm not sure about ARMv5, but v7 and v8-A are Tier 1 for Firefox, so we get a pretty decent smoke test out of them. The thing about embedded is that it's quite diverse, so speaking about it in broad terms is tough. It goes from "lol nope" to "barely works" to "pretty decent" to "great", depending on which thing you're talking about.

We have a whole working group this year working on embedded.

quoting catwell:

> I think they mean "embedded" as in "embedded in a program", not "targeting embedded hardware".

I've been using embedded Rust for my thesis project (ARMv6-M), and it's been a pleasure.

It's been a focus of the team, and been getting better. With efforts being made to move some external tools into the mainline toolchains. Last I checked it was possible, but still a bit cumbersome.

P.S. Really looking forward to writing Rust on AVRs.

I'll take Richard Hipp's C code over just about anyone's Rust code any day of the week. The man is a national treasure!

I'll also take any project that has thousands (if not millions?) of man-hours together with an extremely solid test suite. That doesn't mean other projects aren't worth exploring or won't be useful.

It says it is alpha about 5 times in the readme. What gave you the impression that this was already more robust than SQLite?

But C is very unsafe! /s

sorry you got strikefrced, but love your other comments

No clustering :( I feel like good multi-master asynchronous and synchronous clustering is truly the frontier in DBs.

clustering and embedded seem like almost opposite ends of the spectrum

Then again embedded databases are a great building block for things like etcd. It also might even make sense to have something like this in process because that would remove quite a few failure modes coming from the client connection and simplify deployment

You can't cluster an embedded database, this comment makes not sense.

Compare with SQLite, not with Orcale or Postgres.

Currently it's even more basic. The current usable parts are a pagecache following the llama approach, some great testing utility libraries, and an index (that you can use as a kv) that follows the bwtree approach. Later it will have structured access support, but it needs some more db components to get there. It is a construction kit as well as a kv.

So... SQLite FoundationDB?

Not really, when the embedded database is allowed to run its own threads and network connections embedding something like etcd makes perfect sense. It removes the failure modes of the client connection and simplifies deployment.

What about TiKV (https://github.com/pingcap/tikv), a distributed transactional Key-Value store?

Use it for large scale HTAP! It's great for its flexible use cases at high scales :)

What would be the use-case for multi-master clustering in embedded?

"real" serverless - mesh databases! ;p

This is honestly a use case I'm experimenting with using a mix of CRDTs and OT. Our systems are becoming more and more location agnostic and I don't feel that our current data infrastructure is adequate to serve the workloads we're going to be facing as compute migrates to the edge.

It's modular, and there is a paxos implementation, but it has been built totally in simulation so far and I haven't plugged it into an io layer yet. But this is trivial. That said, sled will always be a bwtree index, and the other modular crates will stand on their own.

what do you think of FoundationDb?

I am a total devotee to their approach to building simulable systems, although I seek to push it even farther and integrate lineage driven fault injection from an early stage. I see a lot of cool things in what they have done, technically. Sled is free from day one.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact