Hacker News new | past | comments | ask | show | jobs | submit login

It's hard to take this seriously: storage is an excruciatingly hard problem, yet this cheerful description of a nascent and aspirational effort seems blissfully unaware of how difficult it is to even just reliably get bits to and from stable storage, let alone string that into a distributed system that must make CAP tradeoffs. There is not so much of a whisper as to what the data path actually looks like other than "the design includes the ability to support [...] Reed-Solomon error correction in the near future" -- and the fact that such an empty system hails itself as pioneering an unsolved problem in storage is galling in its ignorance of prior work (much of it open source).

Take it from someone who has been involved in both highly durable local filesystems[1] and highly available object storage systems[2][3]: this is such a hard, nasty problem with so many dark, hidden and dire failure modes, that it takes years of running in production to get these systems to the level of reliability and operability that the data path demands. Given that (according to the repo, though not the breathless blog entry) its creators "do not recommend its use in production", Torus is -- in the famous words of Wolfgang Pauli -- not even wrong.

[1] http://dtrace.org/blogs/bmc/2008/11/10/fishworks-now-it-can-...

[2] http://dtrace.org/blogs/bmc/2013/06/25/manta-from-revelation...

[3] http://dtrace.org/blogs/dap/2013/07/03/fault-tolerance-in-ma...




I totally agree with you. I also liked how they said the motivation is to make Google infrastructure for everyone else. How did Google do this? They basically imitated and improved on clustered filesystems developed in HPC. There were a lot of lessons to learn for emerging cloud market in all tooling done in HPC. Some was FOSS.

Whereas, many companies seem to be doing the opposite in their work on these cloud filesystems. They don't build on proven, already-OSS components that have been battle-tested for a long time. They lack features and wisdom from prior deployments. They duplicate effort. They are also prone to using popular components, languages, whatever with implicitly-higher risk due to fact they were never built for fault-tolerant systems. Actually, many of them often assume something will watch and help them out in time of failure.

If it's high-assurance or fault-tolerance, my manta is "tried and true beats novel and new." Just repurpose what's known to work while improving its capabilities and code quality. Knocks out risks you know about plus others you don't since they were never documents. Has that been your experience, too? Should be a maxim in IT given how often problem plays out.


While I strongly agree with bcantrill, I have to disagree with your HPC origins model in the most strenuous fashion.

The HPC world can be thought of as an alternative universe, one where dinosaurs came to some sort of lazy sentience instead us mammals. HPC RAS (reliability, availability, serviceability) requirements and demands are profoundly different and profoundly lower from what enterprise or even SMB (small & medium sized businesses) would find acceptable. There are or were interesting ideas in GPFS but that was long ago, far away, and sterile besides - not that it stops IBM squeezing revenue from locked-in customers.

Some people in HPC do realize their insular state, but mostly there are hopes for things to be done better that were done better earlier (e.g. "declustered RAID") in the "real" world. Their ongoing hype is for yet-another-meta-filesystem built with feet of straw upon a pile of manure.

The likes of Google infrastructure, are built instead on academic storage R&D, which has been a great source of ideas and interesting test implementations - log structured file systems, distributed hash tables, etc. There is some small overlap, for example in the work of Garth Gibson, but it's not a joint evolution at all.


HPC as an alternative universe is a fine metaphor. Yet, I don't see how there's no comparison given HPC already had more RAS than I could get out of any Linux box for years. IBM SP2, SGI, and some Beowulf-style clusters had high-performance I/O, single-system image, distributed filesystems, and fault-tolerance where you wanted. That sounds a lot like what cloud market aimed for. On top of it, you had grid and metacomputing groups that spread it out across multiple sites with ease of use for developers.

They developed in parallel and separately but there's lots of similarities. Many startups in HPC were taking HPC stuff to meet cloud-style requirements for years in terms of management, reliability, and performance. That it wasn't up to the standards of businesses is countered by how many bought those NUMA machines and clusters for mission-critical apps. A huge chunk of work in Windows and Linux in 90's and early 2000's went to trying to match their metrics in order to replace them on the cheap with x86 servers and Ethernet. So, yes, HPC was making stuff to meet tough, business needs that commercial servers and regular stacks couldn't compare to. That's despite number-crunching supercomputers being their main market.


Maybe things have fallen back drastically then or we're using different definitions of HPC (it's the likes of weather or bombs for me).

Supercomputer system MTBF numbers aren't high at all. Supercomputer storage is fault tolerant only by fortuitous accident and often requires week-long outages with destroyed data for software upgrades. These systems are built with the likes of dubious metafilesystems (I'm talking to you, Lustre) sitting on top of shaky file systems on top of junk-bin HW RAID systems almost guaranteed to corrupt your data over time.

I think your overall statement is valuable with the global search/replace of "enterprise" for "HPC" - airline reservations or credit card processing. That's all I'm saying - HPC people (at least currently) are massively overrated as developers and as admins. Maybe it's all just a plan to extract energy from Seymour Cray's pulsar-like rotating remains?


" Supercomputer storage is fault tolerant only by fortuitous accident and often requires week-long outages with destroyed data for software upgrades."

Have you tried Ceph or Sector/Sphere? Lustre is known to be crap while Ceph gets a lot of praise and Sector/Sphere has potential for reliability with good performance. I think you may just be stuck with tools that suck. I'll admit it was about 10+ years ago when I was knowledgeable of this area. It could've all gone downhill since.

"That's all I'm saying - HPC people (at least currently) are massively overrated as developers and as admins. "

I'll agree with that. It's one of reasons field built so much tooling to make up for it. ;)


What does "potential for reliability" even mean? Even Lustre, which you malign and I have maligned even more, has potential for reliability if they just fix a few hundred egregious design flaws.


I haven't had a chance to run it. I also don't have clear data on it's userbase. I just know it's been used in supercomputing centers for a while doing large jobs fast over fast WAN's. So, potentially reliable if I see more users telling me it is in various situations.


Re CEPH vs Lustre: what's performance like? I've seen anecdotes quoting 3 GBps over Infiniband for Lustre.

(I'm curious because I run an HPC installation with Lustre and NFS over XFS, and trying to think of the future. MBTF doesn't matter as much as raw speed while it actually runs.)


At this point, this is really an apples-to-oranges comparison.

Lustre, as truly awful as it is, is a POSIX filesystem (or close enough for (literally) government work).

Redhat/Ceph only at the end of April, announced that POSIX functionality was ready for production. Personally, that's not when I'd choose to deploy production storage. Ceph object and nominally block have much more time in production.

If you need POSIX, trusting Ceph at this point is an issue unless, as you say, MTBF isn't a concern. You might want to try BeeGFS, a similar logical model but much simpler to implement, performance up to a very high level, and a record of reliable HPC deployments (as oxymoronic as that sounds).

If you can do with object then certainly exorcise Lustre from your environment in favor of Ceph (or try Scality if non-OS isn't an issue). Lustre's only useful as a jobs program anyway - keeping people occupied who'd otherwise be bodging up real software.


A proprietary (and solid) alternative to Lustre would be GPFS, which also has a long track record in HPC (and other markets in which IBM thrives).

As someone who completely shares your Lustre sentiment, I can't fathom why Intel keeps pouring resources into it.


GPFS has an amazing number of features, offers high performance and, given a certain fiddliness of configuration and administration, is reliable and performant. It can even sit on top of block storage that itself manages with advanced software RAID and volume management.

The problem (surprise!) is IBM. It's mature software, which means 21st Century Desperate IBM sees it as a cash cow - aggressively squeezing customers - and as something they can let their senior, expensive developers move on from - or lay them off in favor of "rightsourcing". You can certainly trust your data to it (unlike Lustre), but it'll be very expensive, especially on an ongoing basis, and the support team isn't going to know more than you by then. Also expect surprise visits from IBM licensing ninja squads looking for violations of the complex terms, which they will find.

As for Lustre, it brings to mind Oliver Wendell Holmes, Jr's, "Three generations of imbeciles are enough". I've been at least peripherally involved with it since 1999, with LLNL trying to strong-arm storage vendors into support. Someone should write a book following 16 years of the tangled Lustre trail from LLNL/CMU/CFS -> Sun -> Oracle -> WhamCloud -> OpenSFS -> ClusterStor -> Xyratex -> Seagate -> Intel (and probably ISIS too).

The answer to your question IHMO, is that Intel just isn't that smart. They're basically a PR firm with a good fab in the basement. What do they know about storage or so many other things? People don't remember when they tried to corner the web serving market back during the 1st Internet boom. They fail a lot, but until now had enough of a cash torrent coming that it didn't matter. They still do, of course, but there are inklings of an ebb.


Yeah, yeah, GPFS was one of them that inspired my HPC and cloud comparison. It, combined with management software, got one to about 80-90% of what they needed for cloud filesystems. It was badass back when I read about it being deployed in ASC Purple. I didn't know it turned into some stagnating, fascist crap with IBM. Sad outcome for such great technology.

Typical IBM, though. (shakes head)


Sounds like good recommendations based on what research I've done in these things. I forgot BeeGFS but it was in my bookmarks. Must be good in some way. ;)


Cheers!


Found these for CEPH that indicates there's fast deployments for Infiniband. Don't have more data as I've been out of HPC a while.

https://www.mellanox.com/related-docs/solutions/ppt_ceph_mel...

http://www.snia.org/sites/default/files/JohnKim_CephWithHigh...

Also, look at Sector/Sphere which was made by UDT designer for distributed, parallel workloads for supercomputing. It has significant advantages over Hadoop. It's used with high-performance links to share data between supercomputing centers.

http://sector.sourceforge.net/index.html

http://sector.sourceforge.net/pub/Sector%20vs%20Hadoop%20-%2...


Native RDMA support for Ceph is still a ways off. The current implementation requires disabling Cephx authentication, which is a no-go in any environment where you can't completely trust every client (e.g. "cloud", where most current deployments/users live). It also hasn't seen much development since the initial proof-of-concept (still highly experimental).

That said, IPoIB should work just fine, and the main bottleneck currently is (Ethernet) latency. I'm running a couple of 1,6PB clusters (432 * 4TB) and can only get 20-60MBps on a single client with a 4kB block size, but got bored of benchmarking after saturating 5 concurrent 10Gb clients with a 4MB block size.

I do expect the RDMA situation to improve substantially over the next year or so, even if authentication will still be unsupported. The latter generally isn't a problem in HPC where stuff like GPFS lives (where you also have to trust every client). And they clearly want that market now that CephFS is finally deemed production ready.


In the HPC crowd, I'm quite familiar with OrangeFS (aka PVFS2) which recently entered the standard kernel. I had a PVFS 2.7 cluster running for many years, 24/7 with decent reliability (it crashed a few times, but never lost data).

It works with RDMA, has a POSIX layer, and is roughly equivalent to Lustre in performance in my tests, but 1° is very easy to setup (compared to Lustre) 2° has NFS actually working.


Yes, that's absolutely been my experience -- and even then, when it comes to the data path, you will likely find new failure modes in "tried and true" as you push it harder and longer and with the bar being set at absolute perfection. I have learned this painful lesson twice: first, with Fishworks at Sun when we turned ZFS into a storage appliance -- and we learned the painful difference between something that seems to work all of the time and something that actually works all of the time. (2009 was a really tough year.[1]) And ZFS was fundamentally sound (and certainly had been running for years in production) before we pushed it into the broad enterprise storage substrate: the bugs that we found weren't ones of durability, but rather of deeply pathological performance. (As I was fond of saying at the time, we never lost anyone's data -- but we has some data take some very, very long vacations.) I shudder to think about building a data path on much less proven components than ZFS circa 2008, let alone building a data path seemingly in total ignorance of the mechanics -- let alone challenges -- of writing to persistent storage.

The second time I learned the painful lessons of storage was with Manta.[2] Here again, we built on ZFS and reliable, proven "tried and true" technologies like PostgreSQL and Zookeeper. And here again, we learned about really nasty, surprising failure modes at the margins.[3] These failure modes haven't led to data loss -- but when someone's data is unavailable, that is of little solace. In this regard, the data path -- the world of persistent state -- is a different world in terms of expectations for quality. That most of our domain thinks in terms of stateless apps is probably a good thing: state is a very hard thing to get right, and, in your words, tried and true absolutely beats novel and new. All of this is what makes Torus's ignorance of what comes before it so exasperating; one gets the sense that if they understood how thorny this problem actually is, they would be trying much harder to use the proven open source components out there rather than attempt to sloppily (if cheerfully) reinvent them.

[1] http://dtrace.org/blogs/bmc/2010/03/10/turning-the-corner/

[2] https://github.com/joyent/manta

[3] https://www.joyent.com/blog/manta-postmortem-7-27-2015


That was a pretty humble and good read. I don't think I'd have seen the autovacuuming issue coming. Actually, this quote is a perfect example of how subtle and ridiculous these issues can be:

"During the event, one of the shard databases had all queries on our primary table blocked by a three-way interaction between the data path queries that wanted shared locks, a "transaction wraparound" autovacuum that held a shared lock and ran for several hours, and an errant query that wanted an exclusive table lock."

That's with well-documented, well-debugged components doing the kinds of things they're expected to do. Still downed by a series of just three interactions creating a corner case. Three out of a probably ridiculous number over a large amount of time. Any system redoing and debugging components plus dealing with these interaction issues will fare far worse. Hence, both of our recommendations to avoid that risk.

Note: Amazon's TLA+ reports said they model-checkers for finding bugs that didn't show up until 30+ steps in the protocols. An unlikely set of steps that actually was likely in production per logs. Reading such things, I have no hope that code review or unit tests will save my ass or my stack if I try to clean-slate Google or Amazon infrastructure. Not even gonna try haha.


> Note: Amazon's TLA+ reports said they model-checkers for finding bugs that didn't show up until 30+ steps in the protocols. An unlikely set of steps that actually was likely in production per logs. Reading such things, I have no hope that code review or unit tests will save my ass or my stack if I try to clean-slate Google or Amazon infrastructure. Not even gonna try haha.

For those unfamiliar with the reference, there was an eye-opening report from Amazon engineers who'd used formal methods to find bugs in the design of S3 and other systems several years ago [0]. I highly recommend reading it and then watching as many of Leslie Lamports talks on TLA+ and system specifications as possible.

[0] http://research.microsoft.com/en-us/um/people/lamport/tla/fo...


Anyone who's worked with Postgres at scale would guess autovacuum.

Postgres doesn't have many weaknesses, but most of them relate to autovacuum.


"Anyone who's worked with Postgres at scale would guess autovacuum."

Well, there's knowing it's autovacuum-related then there's the specific way it's causing a failure. First part was obvious. The rest took work.

"Postgres doesn't have many weaknesses, but most of them relate to autovacuum."

Sounds like that statement should be on a bug submission or something. They probably need to replace that with something better.


It's known to the postgres developers, and we are working on it. This specific issue (anti-wraparound vacuums being a lot more expensive) should be fixed in the upcoming 9.6.


Awesome! I already push Postgres and praise its team for the quality focus. Just extra evidence in your favor. :)


What triggered me was just throwing out "reed solomon" when talking about random writes. How does that work? We'll read from 5 places to complete your write?


My impression was that they heard reed solomon was used in robust systems like they are describing. They intend to use it in theirs. It will therefore be just as robust. Similar to how some firms describe their security after adding "256-bit, military-grade AES." ;)


They operate on blocks and can implement Reed-Solomon with no issues. Random writes do not matter with the architecture like this. The tricky part would be latency and performance during periods of growth.


Yeah, but they have to read several data/parity blocks, and then rewrite all parity blocks plus one data block, for any write to a given block.

This creates big difficulties for both consistency and performance, and fixes for consistency make the performance worse (and vice versa).

Google's filesystem could use reed-solomon because they're append-only, making consistency a non-issue and performance can be fixed by buffering on the client side.


Torus is append-only too. We also plan to support something more like what Facebook's paper describes, where they have extra parity (xor) to support more efficient local repair.


How? I thought you're exporting a block device, not a filesystem? You can't append to a block device, and certainly every filesystem out there expects block devices to be random-writable, right?


The "interface" we're exporting is very different from the underlying storage. The block device interface we currently provide supports random writes just fine, but the underlying storage we use (which involves memory-mapped files) is append-only. Once written, blocks are only ever GC'd, not modified.


So if I were to run a database on this, wit a lot of overwrites, the storage would grow infinitely?

Secondly, this implies you are remapping the LBA (offsets) all the time, perhaps taking what would be sequential access and turning it into random? That sounds pretty painful.


Nope, previous block versions get GC'd. I don't see how LBAs have any relevance here... you're talking about a much lower layer than what Torus is operating on.


You're providing a block device interface to the container. The container's FS is addressing LBAs. Sequential reads to the container's adjacent LBAs get turned into reads to whatever random Torus node is storing the data, based on when it was last written...


Exactly what you said. Torus is exposing block on top of what could be described as a log structured FS. So while you may not know about LBAs, there are LBAs involved. I took a look at the code and you are putting a FS like ext4 on top of your block device. Any time an LBA is written to, you append to your store. This causes sequential access to become random, and in addition causes unneeded garbage collection issues.

Further more, it appears to me that etcd is now in the "data path" That is, in theory, each access could end up hitting etcd.

If so, I really would question why anyone would do this at all... this is not how any storage system is written.


The problem here is that you are trying to do block on a file system. This is a bigger problem than you can imagine and while you may think lbas are not involved, there actually are. You are naively taking on a well known area in storage


Ok, so that plus a little MVCC can make you consistent, but you've still got the read-many-to-write-one thing from the perspective of your block device interface, right? And block devices, if I'm remembering right, don't leave you any room to buffer pending writes.


Torus implements a kind of MVCC, yes. As for read-many-to-write-one, I assume you're talking about Reed-Solomon or similar erasure coding? There have been some papers written about ways to reduce that, a good one is from Facebook: https://code.facebook.com/posts/536638663113101/saving-capac.... And that's just one option. Also, this is all speculative since we have yet to implement erasure coding.


Don't know if you'll see this, but:

If only one host at a time has access to a given virtual block device, there are some opportunities to buffer outgoing writes with a write-through cache. That might be the way to go if you explore erasure coding in the future.


Well, good luck with it. Don't get me wrong, I'd love 1.5x redundancy overhead instead of 3x. But even if you have to downgrade to offering either replication or XOR, it's still a huge missing piece of the typical container deployment, so good luck.


Agree 100% on this, Bryan. Their claims about how existing storage solutions are a poor fit for this use case are completely false. As too often happens, they address only high-margin commercial systems and ignore the fact that open-source solutions are out there as well. (Disclaimer: I'm a Gluster developer). Then they leave both files and objects as "exercises for the reader" which shows a total lack of understanding for the problem or respect for people already working on it. Their announcement is clearly more about staking a claim to the "storage for containers" space, before the serious players get there, than it is about realistic expectations for how the project might grow. Particularly galling are their comments about the difficulty of building a community for something like this, when such communities already exist and their announcement actually harms those communities.

Until now, I've thought quite highly of the CoreOS team. Now, not so much. They're playing a "freeze the competition" game instead of trying to compete on merit.


How is them developing their own technology not competing on merit? Did they steal the technology? Did they claim that anyone ought to use this in production? Why all the negativity? Yeah never mind let's just discourage everyone from trying to build new technology. Nobody is forcing you to use this. If it's not for you, move on. Going out on a rant about what you think their intentions are is ridiculous.

Really you come off as defensive. I don't understand how CoreOS has disrespected anyone by coming up with their own approach to the problem.

They're not writing blog posts or comments on HN, they're writing code.


> They're not writing blog posts or comments on HN, they're writing code.

Actually, the problem is that 90% of the code still remains to be written, while others (including me) have already done so. They've addressed only the very simplest part of the problem, not even far enough to show any performance comparisons, in a manner strongly reminiscent of Sheepdog (belying your "own approach" claim). That's a poor basis from which to promise so much. It's like writing an interpreter for a simple programming language and claiming it'll be a full optimizing C++ compiler soon. Just a few little pieces remaining, right?

It's perfectly fine for them to start their own project and have high hopes for it. The more the merrier. However, I have little patience for people who blur the lines between what's there and what might hypothetically exist some time in the future. That's far too often used to stifle real innovation that's occurring elsewhere. Maybe it's more common in storage than whatever your specialty is, but it's a well known part of the playbook. It's important to be crystal clear about what's real vs. what's seriously thought out vs. what's total blue-sky. Users and fellow developers deserve nothing less.


Actually running just:

    torusctl init
    torusctl -C $ETCD_IP:2379 init
   ./torusd --etcd 127.0.0.1:2379 --peer-address http://127.0.0.1:40000 --data-dir /tmp/torus1 --size 20GiB
to have a near production ready system is way easier than setting up glusterfs (even as a demo) and ceph. A distributed system doesn't need to be complicated.


Actually that's almost exactly the same steps as for GlusterFS.

  > gluster peer probe ...
  > gluster volume create ...
  > gluster volume start ...
But that's not even the point. You're right that the interface to a distributed storage system doesn't need to be complicated, but the implementation inevitably must be to handle the myriad error conditions that will be thrown at it. Correctness is even more important for storage than for other areas in computing, and something that only implements the "happy path" for the simplest data model or semantics is barely even a beginning. The distance between "seems to work" and "can be counted on to work" is far greater for this type of system than for most others. I think it's important to understand and communicate that, so that people don't develop unrealistic expectations. That way lies nothing but heartbreak, not least for the developers themselves. It's far better for everyone to set and meet modest goals than to make extravagant promises that can't be kept.


Also, they are writing comments on HN. As are you. What exactly is your point?


If I could also add that NBD is extremely notorious in Linux. I am a block storage developer and NBD has been one of the main reasons why so many openstack storage products (like formation data and others) have really struggled. There are many known Linux kernel issues with NBD, for example, if an NBD provider (user space daemon) exits for any reason, the kernel panics. Here is an example of a long outstanding bug that has plagued the NBD community and openstack community for years (https://www.mail-archive.com/nbd-general@lists.sourceforge.n...). It won't get addressed any time soon. I also looked at the repo and it looks like a very simplistic approach to a hard problem. BTW I have used etcd and if this is anything like etcd, I'd really be worried. etcd snapshots bring an entire cluster down for a while.


Here is an example script that will panic anyone using NBD to server a block device (replace qemu-nbd with the nbd export provided by torus)

qemu-img create -f qcow2 f.img 1G mkfs.ext4 f.img modprobe nbd || true qemu-nbd -c /dev/nbd0 f.img mount /dev/nbd0 k killall -KILL qemu-nbd sleep 1 ls k


Above script with correct formatting:

  qemu-img create -f qcow2 f.img 1G
  mkfs.ext4 f.img
  modprobe nbd || true
  qemu-nbd -c /dev/nbd0 f.img
  mount /dev/nbd0 k
  killall -KILL qemu-nbd
  sleep 1
  ls k


But: that's not a kernel panic, just a BUG() stack trace from kernel which doesn't halt the system.


Looking at dtrace, fishwork, zfs and Solaris (imo was the best OS technologically speaking) it's interesting to see how strong and innovative its engineering team was while the business was just going down. How did that happen? How can the engineering team be that productive and functional while the business vision was so lacking?


Even when commercially misguided, Sun always had terrific engineering talent -- and my farewell to the company captures some of that.[1] In terms of why did the company fail, the short answer is probably that SPARC was disrupted by x86, and by the time the company figured that out, it was too late to recover.[2]

[1] http://dtrace.org/blogs/bmc/2010/07/25/good-bye-sun/

[2] Longer answer: https://news.ycombinator.com/item?id=2287033


I've always been very impressed by Sun's talent, the team was . And thank you for dtrace! :) If only Linux community / leaders would be less arrogant and adopt technologies from Solaris/BSD that are order of magnitude better (kqueue, netgraph and more) instead of coming up with new ways to screw up.

Sun should be resurrected now given that risc is leading the way and build everything on top of ARM! :). if only.


Linux can't be blamed for the decision to place Sun's technologies under the CDDL.


there's no license for kqueue's interface or netgraph, and many other great interfaces. but they decide to create square wheels and not leverage from errors that others have made before them.


You're right.

I was thinking mostly of the eternal buzzkill that ZFS is only usable through indirect means.


Just how many things is CoreOS trying to do? Last I counted, they want to

a) Build a distributed OS

b) Build a distributed scheduler (Fleet)

c) Build a distributed key value system (etcd)

d) Build a new container engine (Rocket)

e) Build a network fabric (Flannel)

f) Now embark on building a brand new distributed storage system.

Holy cow that's some goal list. Sounds like something my kids would make up for their Christmas wish list.

I really don't get how their board and investors let them get away with such a childish imagination.

Each one of those is a company effort on its own.


Fleet died once Kubernetes became a thing, even though they serve entirely different use cases.


Actually the biggest problem is actual either Full Posix or Mutable Storage. If there is just one way in and never a way out, it's actually way easier. So practically if you don't run your database on shared storage etc and manage it "the old way". you could have something reliable by a 'never delete'.

However CAS and immutable file systems aren't that common.


Their one saving grace might be that NBD is such a simple protocol. By not trying to make it a full-fledged FS, they might actually have hit on a tractable problem.


> unaware of how difficult it is to even just reliably get bits to and from stable storage

It's not difficult, it's actually impossible. And any system relying on the reliability of that would be broken by design.

> let alone string that into a distributed system that must make CAP tradeoffs

No. Distributed systems and "CAP tradeoffs" actually exist to solve the problem above with predictable certainty.


It seems you're disagreeing with yourself.


Care to point out where? I can explain if something seems too ambiguous.

It is impossible to reliably store something in your local storage. It is possible to achieve some probability of data retention and availability in a distributed system.


Just as a separate question: why are you so bitter about btrfs? ;)




Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: