I wrote more about this on the FDB forums: https://forums.foundationdb.org/t/block-storage-full-fault-t...
It wouldn't get you the locking you want (multiple attempts to boot the VM end in mutual exclusion at the block storage layer), I believe, but I think that's probably better done with an etcd or something if you want that.
Another huge payoff is the new "satellite" replication mode coming (I'm told) with FDB 6.0, which will give you "the best of both worlds" of synchronous and asynchronous replication across regions. A block device will let your legacy systems usually fail over across regions automatically without losing a single write.
Async replication is also available with rbd-mirror.
Upon further reading: I didn't realize that there is a TRIM command. I search briefly and missed that. Neat.
Regarding locking, is there a distributed lock manager? What prevents two NBD clients connecting to the same volume via two different nodes?
NBD is an extremely simple protocol. Read range, write range, delete range, sync -- that's it. If you want to implement an NBD server from scratch, you can totally do so in an afternoon. I have done this and use it in production: https://github.com/sandstorm-io/blackrock/blob/master/src/bl...
iSCSI is comparatively far more complex. It's a TCP-based adaption of the SCSI protocol, which has existed for decades as a way to talk to hard drives. As I understand it, you can pass arbitrary SCSI commands over iSCSI; see: https://en.wikipedia.org/wiki/SCSI_command iSCSI is enterprise-y and has a bigger ecosystem. You can netboot a diskless machine into Windows over iSCSI (I do this: http://kentonsprojects.blogspot.com/2011/12/lan-party-house-...).
Personally I like NBD a lot better because the simplicity means you can build new, cool things with it. But there are others who would say that NBD is a toy compared to iSCSI.
I would apologise for the code, but "how small can I make this" was sort of the point...
Just because `write()` returned successfully does not mean that the data has been written to disk (whether you're using NBD or otherwise). The application needs to call `fsync()` to force writes to disk and get confirmation of success. An `fsync()` will send all pending NBD_CMD_WRITEs followed by NBD_CMD_FLUSH and will only return success when all of these have completed successfully.
Once you've got replication handled in the right place, the only thing dedicated RAID hardware is really giving you is hot-swap capability (which, yeah, you do still need).
What would work is mounting the same file system read only from multiple machines.
For a production system, you do not want an NBD device backed by Python code on the same machine. That way lies nasty memory allocation deadlocks.