
Show HN: A proof-of-concept FoundationDB based network block device backend - dividuum
https://github.com/dividuum/fdb-nbd
======
voidmain
Awesome. I actually think that there is a lot of potential for using block
storage over FDB to bring extreme fault tolerance to legacy applications
without rearchitecting them. Because block devices are unshared (and you can
use FDB to ensure that robustly) conventional OS caching works well, and you
can get relatively normal performance. Of course FDB storage engine is
currently not very well optimized for this use case, but since FDB is scalable
you can "just use more"

I wrote more about this on the FDB forums:
[https://forums.foundationdb.org/t/block-storage-full-
fault-t...](https://forums.foundationdb.org/t/block-storage-full-fault-
tolerance-for-legacy-applications/182)

~~~
geofft
Would this provide additional benefits over Ceph, which does no-single-point-
of-failure networked block storage and is already a production-ready system?

It wouldn't get you the locking you want (multiple attempts to boot the VM end
in mutual exclusion at the block storage layer), I believe, but I think that's
probably better done with an etcd or something if you want that.

~~~
voidmain
Unless you have serious simulation or model checking tools, your ad hoc
combination of etcd and ceph will almost certainly be buggy. I'm not 100%
positive it is even possible in the asynchronous model. And when you decide
you want, say, regional or rack aware fault tolerance, you have _three_ more
distributed systems (ceph, etcd, and your combination) to make meet that goal.
When you restore a backup of the contents of ceph, what do you do with the
lock state in etcd? Do you know what happens when (say) etcd runs out of
memory? If you wanted to ship an on premise version of your solution, can you
teach your customers to admin all these systems? FDB's mission is to get under
_all_ of your state storage so that you only need _one_ system to do all this
operational stuff correctly. A block device is _most_ useful for people who
are trying to eliminate the last bits of state _not_ in FDB from their
infrastructure.

Another huge payoff is the new "satellite" replication mode coming (I'm told)
with FDB 6.0, which will give you "the best of both worlds" of synchronous and
asynchronous replication across regions. A block device will let your legacy
systems usually fail over across regions automatically without losing a single
write.

~~~
emmericp
RBD (Ceph block devices) supports exclusive locking. The RBD kernel client has
support for that since kernel 4.9. Older kernels can use rbd-nbd to use the
user space implementation of RBD.

Async replication is also available with rbd-mirror.

------
spullara
Cool! Here is my java version that I wrote back in the day. Recently updated
to 5.1.

[https://github.com/spullara/nbd](https://github.com/spullara/nbd)

~~~
scoot
How do you handle resilience? if FDB is multi-node, do you run a copy on each
node? If so, how is failure detected / client traffic redirected, and what
protects against concurrent access from multiple clients to the same volume?

~~~
spullara
See [https://www.foundationdb.org](https://www.foundationdb.org) for
information on FDB but in short it is a distributed database that is auto-
scaling, auto-healing, auto-sharding and ACID. You do need a minimum of 3
nodes to get those properties. When you connect to a volume it is protected by
a lock that ensures only a single client.

~~~
scoot
Thanks for the reply. Don't know if you will see this, but I was also
interested in how the NBD client fails over to an alternate node if the one
it's connected to fails.

Regarding locking, is there a distributed lock manager? What prevents two NBD
clients connecting to the same volume via two different nodes?

------
CaseFlatline
Could someone explain why/how NBD is better than just using a linux host as an
iscsi target? Googling NBD vs iSCSI shows old articles with no real solid
conclusion.

~~~
kentonv
It's not really "better" or "worse".

NBD is an extremely simple protocol. Read range, write range, delete range,
sync -- that's it. If you want to implement an NBD server from scratch, you
can totally do so in an afternoon. I have done this and use it in production:
[https://github.com/sandstorm-
io/blackrock/blob/master/src/bl...](https://github.com/sandstorm-
io/blackrock/blob/master/src/blackrock/nbd-bridge.c++)

iSCSI is comparatively far more complex. It's a TCP-based adaption of the SCSI
protocol, which has existed for decades as a way to talk to hard drives. As I
understand it, you can pass arbitrary SCSI commands over iSCSI; see:
[https://en.wikipedia.org/wiki/SCSI_command](https://en.wikipedia.org/wiki/SCSI_command)
iSCSI is enterprise-y and has a bigger ecosystem. You can netboot a diskless
machine into Windows over iSCSI (I do this:
[http://kentonsprojects.blogspot.com/2011/12/lan-party-
house-...](http://kentonsprojects.blogspot.com/2011/12/lan-party-house-
technical-design-and.html)).

Personally I like NBD a lot better because the simplicity means you can build
new, cool things with it. But there are others who would say that NBD is a toy
compared to iSCSI.

~~~
deforciant
Did you encounter any problems with NBD caching when it acknowledges the write
to the application but doesn't pass it to your "backend" therefore leaving no
room for error handling if that backend goes away?

~~~
kentonv
NBD provides a virtual block device, so all the normal filesystem caching the
kernel does above a hard drive applies to NBD as well. This is good: this is
what makes it so fast.

Just because `write()` returned successfully does _not_ mean that the data has
been written to disk (whether you're using NBD or otherwise). The application
needs to call `fsync()` to force writes to disk and get confirmation of
success. An `fsync()` will send all pending NBD_CMD_WRITEs followed by
NBD_CMD_FLUSH and will only return success when all of these have completed
successfully.

------
oneplane
What makes the use of FoundationDB special in this case?

~~~
haneefmubarak
I think the biggest feature is that you get (up to) linearizable consistency
while still being FAST, meaning you can rest assured that any writes are
actually permanent, even if a server or two fails (assuming your replication
factor is set appropriately).

------
ryanworl
Could you mount this on multiple machines and do simultaneous reads and
writes?

~~~
dividuum
By default no: It's a block device, so it has no idea what the file system on
top is doing. Mounting the same file system in rw mode twice would result in
instant corruption unless the file system is built for that. I'm not sure such
a file system exists.

What would work is mounting the same file system read only from multiple
machines.

~~~
ryanworl
I ask because it would be nice to run a traditional database this way for HA.
I suppose nearly as good is being able to instantly mount and unmount across
multiple machines if you could also use FoundationDB to fence writes to the
device if the old writer comes back as a zombie.

~~~
spullara
You can definitely do this.

------
0x006A
does the zlib compression in python make it faster or slower?

~~~
dividuum
Good question. I actually didn't test that at all. I mostly added that as I
saw a few mostly empty blocks written and compression lowered the total space
usage a bit.

~~~
stock_toaster
You might give lz4 a try as well. ZFS uses it as a compression option, and the
overhead is quite small (especially compared to zlib).

~~~
voidmain
A sophisticated approach to block compression would be to build a dictionary
using something like zstandard's "training mode" and a random sample of data
blocks, store it (versioned) in FDB as well and reference the dictionary
version from compressed blocks. That would get better compression with the
relatively small blocks.

------
amluto
Very cute.

For a production system, you do not want an NBD device backed by Python code
on the same machine. That way lies nasty memory allocation deadlocks.

~~~
btown
Why would this be a bad thing? The Python code would be loaded from a
traditional volume.

~~~
amluto
Suppose you've filled up most of your memory and that some of that memory
contains data that you've written to an NBD-backed device but that hasn't been
flushed out yet. The kernel may try to free memory by writing out some of that
dirty data. If it were a real _remote_ NBD-backed device, all kinds of fancy
logic in the kernel would kick in to get the data written to the network
without problems. Because it's actually local, that logic won't do much.
Instead, the kernel will successfully deliver the write requests to the Python
interpreter's socket. Now the Python script will have to do its own writeback
logic while the system is still potentially out of memory. At the very least,
this will involve calls to the FoundationDB client library. Since the script
is in Python, it's extremley likely that it will also try to allocate some
memory to store temporary objects used by the Python code. If that allocation
ends up asking the kernel for more memory, then you either deadlock (memory
reclaim is waiting for Python, which is in turn waiting for memory) or you
trigger an OOM condition. If the latter happens and the Python interpreter
gets killed, then you lose data.

