Hacker News new | comments | show | ask | jobs | submit login
Show HN: A proof-of-concept FoundationDB based network block device backend (github.com)
192 points by dividuum 7 months ago | hide | past | web | favorite | 33 comments

Awesome. I actually think that there is a lot of potential for using block storage over FDB to bring extreme fault tolerance to legacy applications without rearchitecting them. Because block devices are unshared (and you can use FDB to ensure that robustly) conventional OS caching works well, and you can get relatively normal performance. Of course FDB storage engine is currently not very well optimized for this use case, but since FDB is scalable you can "just use more"

I wrote more about this on the FDB forums: https://forums.foundationdb.org/t/block-storage-full-fault-t...

Would this provide additional benefits over Ceph, which does no-single-point-of-failure networked block storage and is already a production-ready system?

It wouldn't get you the locking you want (multiple attempts to boot the VM end in mutual exclusion at the block storage layer), I believe, but I think that's probably better done with an etcd or something if you want that.

Unless you have serious simulation or model checking tools, your ad hoc combination of etcd and ceph will almost certainly be buggy. I'm not 100% positive it is even possible in the asynchronous model. And when you decide you want, say, regional or rack aware fault tolerance, you have three more distributed systems (ceph, etcd, and your combination) to make meet that goal. When you restore a backup of the contents of ceph, what do you do with the lock state in etcd? Do you know what happens when (say) etcd runs out of memory? If you wanted to ship an on premise version of your solution, can you teach your customers to admin all these systems? FDB's mission is to get under all of your state storage so that you only need one system to do all this operational stuff correctly. A block device is most useful for people who are trying to eliminate the last bits of state not in FDB from their infrastructure.

Another huge payoff is the new "satellite" replication mode coming (I'm told) with FDB 6.0, which will give you "the best of both worlds" of synchronous and asynchronous replication across regions. A block device will let your legacy systems usually fail over across regions automatically without losing a single write.

RBD (Ceph block devices) supports exclusive locking. The RBD kernel client has support for that since kernel 4.9. Older kernels can use rbd-nbd to use the user space implementation of RBD.

Async replication is also available with rbd-mirror.

Cool! Here is my java version that I wrote back in the day. Recently updated to 5.1.


Nice. That looks way more polished. Thanks for linking!

Upon further reading: I didn't realize that there is a TRIM command. I search briefly and missed that. Neat.

How do you handle resilience? if FDB is multi-node, do you run a copy on each node? If so, how is failure detected / client traffic redirected, and what protects against concurrent access from multiple clients to the same volume?

See https://www.foundationdb.org for information on FDB but in short it is a distributed database that is auto-scaling, auto-healing, auto-sharding and ACID. You do need a minimum of 3 nodes to get those properties. When you connect to a volume it is protected by a lock that ensures only a single client.

Thanks for the reply. Don't know if you will see this, but I was also interested in how the NBD client fails over to an alternate node if the one it's connected to fails.

Regarding locking, is there a distributed lock manager? What prevents two NBD clients connecting to the same volume via two different nodes?

Could someone explain why/how NBD is better than just using a linux host as an iscsi target? Googling NBD vs iSCSI shows old articles with no real solid conclusion.

It's not really "better" or "worse".

NBD is an extremely simple protocol. Read range, write range, delete range, sync -- that's it. If you want to implement an NBD server from scratch, you can totally do so in an afternoon. I have done this and use it in production: https://github.com/sandstorm-io/blackrock/blob/master/src/bl...

iSCSI is comparatively far more complex. It's a TCP-based adaption of the SCSI protocol, which has existed for decades as a way to talk to hard drives. As I understand it, you can pass arbitrary SCSI commands over iSCSI; see: https://en.wikipedia.org/wiki/SCSI_command iSCSI is enterprise-y and has a bigger ecosystem. You can netboot a diskless machine into Windows over iSCSI (I do this: http://kentonsprojects.blogspot.com/2011/12/lan-party-house-...).

Personally I like NBD a lot better because the simplicity means you can build new, cool things with it. But there are others who would say that NBD is a toy compared to iSCSI.

You might enjoy a bunch of toys I wrote to play around with NBD a little while ago: https://github.com/regularfry/tinynbd

I would apologise for the code, but "how small can I make this" was sort of the point...

Did you encounter any problems with NBD caching when it acknowledges the write to the application but doesn't pass it to your "backend" therefore leaving no room for error handling if that backend goes away?

NBD provides a virtual block device, so all the normal filesystem caching the kernel does above a hard drive applies to NBD as well. This is good: this is what makes it so fast.

Just because `write()` returned successfully does not mean that the data has been written to disk (whether you're using NBD or otherwise). The application needs to call `fsync()` to force writes to disk and get confirmation of success. An `fsync()` will send all pending NBD_CMD_WRITEs followed by NBD_CMD_FLUSH and will only return success when all of these have completed successfully.

Gut feeling: NBD vs. iSCSI is like NFS vs. SMB, it's not that it has magical features, but it is more integrated and more purpose-built, and built-in (as in, in the kernel).

Development wise it's a much more simple protocol, iSCSI has a lot of it's own complexity + the SCSI complexity to implement, NDB has a reasonably short RFC style document.

What makes the use of FoundationDB special in this case?

I think the biggest feature is that you get (up to) linearizable consistency while still being FAST, meaning you can rest assured that any writes are actually permanent, even if a server or two fails (assuming your replication factor is set appropriately).

It's not FoundationDB per se that makes this special. It's the ability to throw out hardware RAID and remove a layer of Expensive Things That Go Wrong At The Worst Possible Moment.

Once you've got replication handled in the right place, the only thing dedicated RAID hardware is really giving you is hot-swap capability (which, yeah, you do still need).

Could you mount this on multiple machines and do simultaneous reads and writes?

By default no: It's a block device, so it has no idea what the file system on top is doing. Mounting the same file system in rw mode twice would result in instant corruption unless the file system is built for that. I'm not sure such a file system exists.

What would work is mounting the same file system read only from multiple machines.

You can't run a regular file system like ext4 or xfs but there are file systems designed for sharing a block device: https://en.wikipedia.org/wiki/Clustered_file_system#SHARED-D...

I ask because it would be nice to run a traditional database this way for HA. I suppose nearly as good is being able to instantly mount and unmount across multiple machines if you could also use FoundationDB to fence writes to the device if the old writer comes back as a zombie.

You can definitely do this.

VMwares file system VMFS is designed to do this: https://wikipedia.org/wiki/VMware_VMFS

NBD does not support that. it is a networked block device, the kernel and file systems on top of it do lots of caching. So other clients will not get the latest version if one of them changes a block.

does the zlib compression in python make it faster or slower?

Good question. I actually didn't test that at all. I mostly added that as I saw a few mostly empty blocks written and compression lowered the total space usage a bit.

You might give lz4 a try as well. ZFS uses it as a compression option, and the overhead is quite small (especially compared to zlib).

A sophisticated approach to block compression would be to build a dictionary using something like zstandard's "training mode" and a random sample of data blocks, store it (versioned) in FDB as well and reference the dictionary version from compressed blocks. That would get better compression with the relatively small blocks.

Very cute.

For a production system, you do not want an NBD device backed by Python code on the same machine. That way lies nasty memory allocation deadlocks.

Why would this be a bad thing? The Python code would be loaded from a traditional volume.

Suppose you've filled up most of your memory and that some of that memory contains data that you've written to an NBD-backed device but that hasn't been flushed out yet. The kernel may try to free memory by writing out some of that dirty data. If it were a real remote NBD-backed device, all kinds of fancy logic in the kernel would kick in to get the data written to the network without problems. Because it's actually local, that logic won't do much. Instead, the kernel will successfully deliver the write requests to the Python interpreter's socket. Now the Python script will have to do its own writeback logic while the system is still potentially out of memory. At the very least, this will involve calls to the FoundationDB client library. Since the script is in Python, it's extremley likely that it will also try to allocate some memory to store temporary objects used by the Python code. If that allocation ends up asking the kernel for more memory, then you either deadlock (memory reclaim is waiting for Python, which is in turn waiting for memory) or you trigger an OOM condition. If the latter happens and the Python interpreter gets killed, then you lose data.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact