Take it from someone who has been involved in both highly durable local filesystems and highly available object storage systems: this is such a hard, nasty problem with so many dark, hidden and dire failure modes, that it takes years of running in production to get these systems to the level of reliability and operability that the data path demands. Given that (according to the repo, though not the breathless blog entry) its creators "do not recommend its use in production", Torus is -- in the famous words of Wolfgang Pauli -- not even wrong.
Whereas, many companies seem to be doing the opposite in their work on these cloud filesystems. They don't build on proven, already-OSS components that have been battle-tested for a long time. They lack features and wisdom from prior deployments. They duplicate effort. They are also prone to using popular components, languages, whatever with implicitly-higher risk due to fact they were never built for fault-tolerant systems. Actually, many of them often assume something will watch and help them out in time of failure.
If it's high-assurance or fault-tolerance, my manta is "tried and true beats novel and new." Just repurpose what's known to work while improving its capabilities and code quality. Knocks out risks you know about plus others you don't since they were never documents. Has that been your experience, too? Should be a maxim in IT given how often problem plays out.
The HPC world can be thought of as an alternative universe, one where dinosaurs came to some sort of lazy sentience instead us mammals. HPC RAS (reliability, availability, serviceability) requirements and demands are profoundly different and profoundly lower from what enterprise or even SMB (small & medium sized businesses) would find acceptable. There are or were interesting ideas in GPFS but that was long ago, far away, and sterile besides - not that it stops IBM squeezing revenue from locked-in customers.
Some people in HPC do realize their insular state, but mostly there are hopes for things to be done better that were done better earlier (e.g. "declustered RAID") in the "real" world. Their ongoing hype is for yet-another-meta-filesystem built with feet of straw upon a pile of manure.
The likes of Google infrastructure, are built instead on academic storage R&D, which has been a great source of ideas and interesting test implementations - log structured file systems, distributed hash tables, etc. There is some small overlap, for example in the work of Garth Gibson, but it's not a joint evolution at all.
They developed in parallel and separately but there's lots of similarities. Many startups in HPC were taking HPC stuff to meet cloud-style requirements for years in terms of management, reliability, and performance. That it wasn't up to the standards of businesses is countered by how many bought those NUMA machines and clusters for mission-critical apps. A huge chunk of work in Windows and Linux in 90's and early 2000's went to trying to match their metrics in order to replace them on the cheap with x86 servers and Ethernet. So, yes, HPC was making stuff to meet tough, business needs that commercial servers and regular stacks couldn't compare to. That's despite number-crunching supercomputers being their main market.
Supercomputer system MTBF numbers aren't high at all. Supercomputer storage is fault tolerant only by fortuitous accident and often requires week-long outages with destroyed data for software upgrades. These systems are built with the likes of dubious metafilesystems (I'm talking to you, Lustre) sitting on top of shaky file systems on top of junk-bin HW RAID systems almost guaranteed to corrupt your data over time.
I think your overall statement is valuable with the global search/replace of "enterprise" for "HPC" - airline reservations or credit card processing. That's all I'm saying - HPC people (at least currently) are massively overrated as developers and as admins. Maybe it's all just a plan to extract energy from Seymour Cray's pulsar-like rotating remains?
Have you tried Ceph or Sector/Sphere? Lustre is known to be crap while Ceph gets a lot of praise and Sector/Sphere has potential for reliability with good performance. I think you may just be stuck with tools that suck. I'll admit it was about 10+ years ago when I was knowledgeable of this area. It could've all gone downhill since.
"That's all I'm saying - HPC people (at least currently) are massively overrated as developers and as admins. "
I'll agree with that. It's one of reasons field built so much tooling to make up for it. ;)
(I'm curious because I run an HPC installation with Lustre and NFS over XFS, and trying to think of the future. MBTF doesn't matter as much as raw speed while it actually runs.)
Lustre, as truly awful as it is, is a POSIX filesystem (or close enough for (literally) government work).
Redhat/Ceph only at the end of April, announced that POSIX functionality was ready for production. Personally, that's not when I'd choose to deploy production storage. Ceph object and nominally block have much more time in production.
If you need POSIX, trusting Ceph at this point is an issue unless, as you say, MTBF isn't a concern. You might want to try BeeGFS, a similar logical model but much simpler to implement, performance up to a very high level, and a record of reliable HPC deployments (as oxymoronic as that sounds).
If you can do with object then certainly exorcise Lustre from your environment in favor of Ceph (or try Scality if non-OS isn't an issue). Lustre's only useful as a jobs program anyway - keeping people occupied who'd otherwise be bodging up real software.
As someone who completely shares your Lustre sentiment, I can't fathom why Intel keeps pouring resources into it.
The problem (surprise!) is IBM. It's mature software, which means 21st Century Desperate IBM sees it as a cash cow - aggressively squeezing customers - and as something they can let their senior, expensive developers move on from - or lay them off in favor of "rightsourcing". You can certainly trust your data to it (unlike Lustre), but it'll be very expensive, especially on an ongoing basis, and the support team isn't going to know more than you by then. Also expect surprise visits from IBM licensing ninja squads looking for violations of the complex terms, which they will find.
As for Lustre, it brings to mind Oliver Wendell Holmes, Jr's, "Three generations of imbeciles are enough". I've been at least peripherally involved with it since 1999, with LLNL trying to strong-arm storage vendors into support. Someone should write a book following 16 years of the tangled Lustre trail from LLNL/CMU/CFS -> Sun -> Oracle -> WhamCloud -> OpenSFS -> ClusterStor -> Xyratex -> Seagate -> Intel (and probably ISIS too).
The answer to your question IHMO, is that Intel just isn't that smart. They're basically a PR firm with a good fab in the basement. What do they know about storage or so many other things? People don't remember when they tried to corner the web serving market back during the 1st Internet boom. They fail a lot, but until now had enough of a cash torrent coming that it didn't matter. They still do, of course, but there are inklings of an ebb.
Typical IBM, though. (shakes head)
Also, look at Sector/Sphere which was made by UDT designer for distributed, parallel workloads for supercomputing. It has significant advantages over Hadoop. It's used with high-performance links to share data between supercomputing centers.
That said, IPoIB should work just fine, and the main bottleneck currently is (Ethernet) latency. I'm running a couple of 1,6PB clusters (432 * 4TB) and can only get 20-60MBps on a single client with a 4kB block size, but got bored of benchmarking after saturating 5 concurrent 10Gb clients with a 4MB block size.
I do expect the RDMA situation to improve substantially over the next year or so, even if authentication will still be unsupported. The latter generally isn't a problem in HPC where stuff like GPFS lives (where you also have to trust every client). And they clearly want that market now that CephFS is finally deemed production ready.
It works with RDMA, has a POSIX layer, and is roughly equivalent to Lustre in performance in my tests, but 1° is very easy to setup (compared to Lustre) 2° has NFS actually working.
The second time I learned the painful lessons of storage was with Manta. Here again, we built on ZFS and reliable, proven "tried and true" technologies like PostgreSQL and Zookeeper. And here again, we learned about really nasty, surprising failure modes at the margins. These failure modes haven't led to data loss -- but when someone's data is unavailable, that is of little solace. In this regard, the data path -- the world of persistent state -- is a different world in terms of expectations for quality. That most of our domain thinks in terms of stateless apps is probably a good thing: state is a very hard thing to get right, and, in your words, tried and true absolutely beats novel and new. All of this is what makes Torus's ignorance of what comes before it so exasperating; one gets the sense that if they understood how thorny this problem actually is, they would be trying much harder to use the proven open source components out there rather than attempt to sloppily (if cheerfully) reinvent them.
"During the event, one of the shard databases had all queries on our primary table blocked by a three-way interaction between the data path queries that wanted shared locks, a "transaction wraparound" autovacuum that held a shared lock and ran for several hours, and an errant query that wanted an exclusive table lock."
That's with well-documented, well-debugged components doing the kinds of things they're expected to do. Still downed by a series of just three interactions creating a corner case. Three out of a probably ridiculous number over a large amount of time. Any system redoing and debugging components plus dealing with these interaction issues will fare far worse. Hence, both of our recommendations to avoid that risk.
Note: Amazon's TLA+ reports said they model-checkers for finding bugs that didn't show up until 30+ steps in the protocols. An unlikely set of steps that actually was likely in production per logs. Reading such things, I have no hope that code review or unit tests will save my ass or my stack if I try to clean-slate Google or Amazon infrastructure. Not even gonna try haha.
For those unfamiliar with the reference, there was an eye-opening report from Amazon engineers who'd used formal methods to find bugs in the design of S3 and other systems several years ago . I highly recommend reading it and then watching as many of Leslie Lamports talks on TLA+ and system specifications as possible.
Postgres doesn't have many weaknesses, but most of them relate to autovacuum.
Well, there's knowing it's autovacuum-related then there's the specific way it's causing a failure. First part was obvious. The rest took work.
"Postgres doesn't have many weaknesses, but most of them relate to autovacuum."
Sounds like that statement should be on a bug submission or something. They probably need to replace that with something better.
This creates big difficulties for both consistency and performance, and fixes for consistency make the performance worse (and vice versa).
Google's filesystem could use reed-solomon because they're append-only, making consistency a non-issue and performance can be fixed by buffering on the client side.
Secondly, this implies you are remapping the LBA (offsets) all the time, perhaps taking what would be sequential access and turning it into random? That sounds pretty painful.
Further more, it appears to me that etcd is now in the "data path" That is, in theory, each access could end up hitting etcd.
If so, I really would question why anyone would do this at all... this is not how any storage system is written.
If only one host at a time has access to a given virtual block device, there are some opportunities to buffer outgoing writes with a write-through cache. That might be the way to go if you explore erasure coding in the future.
Until now, I've thought quite highly of the CoreOS team. Now, not so much. They're playing a "freeze the competition" game instead of trying to compete on merit.
Really you come off as defensive. I don't understand how CoreOS has disrespected anyone by coming up with their own approach to the problem.
They're not writing blog posts or comments on HN, they're writing code.
Actually, the problem is that 90% of the code still remains to be written, while others (including me) have already done so. They've addressed only the very simplest part of the problem, not even far enough to show any performance comparisons, in a manner strongly reminiscent of Sheepdog (belying your "own approach" claim). That's a poor basis from which to promise so much. It's like writing an interpreter for a simple programming language and claiming it'll be a full optimizing C++ compiler soon. Just a few little pieces remaining, right?
It's perfectly fine for them to start their own project and have high hopes for it. The more the merrier. However, I have little patience for people who blur the lines between what's there and what might hypothetically exist some time in the future. That's far too often used to stifle real innovation that's occurring elsewhere. Maybe it's more common in storage than whatever your specialty is, but it's a well known part of the playbook. It's important to be crystal clear about what's real vs. what's seriously thought out vs. what's total blue-sky. Users and fellow developers deserve nothing less.
torusctl -C $ETCD_IP:2379 init
./torusd --etcd 127.0.0.1:2379 --peer-address http://127.0.0.1:40000 --data-dir /tmp/torus1 --size 20GiB
> gluster peer probe ...
> gluster volume create ...
> gluster volume start ...
qemu-img create -f qcow2 f.img 1G
modprobe nbd || true
qemu-nbd -c /dev/nbd0 f.img
mount /dev/nbd0 k
killall -KILL qemu-nbd
qemu-img create -f qcow2 f.img 1G
modprobe nbd || true
qemu-nbd -c /dev/nbd0 f.img
mount /dev/nbd0 k
killall -KILL qemu-nbd
 Longer answer: https://news.ycombinator.com/item?id=2287033
Sun should be resurrected now given that risc is leading the way and build everything on top of ARM! :). if only.
I was thinking mostly of the eternal buzzkill that ZFS is only usable through indirect means.
a) Build a distributed OS
b) Build a distributed scheduler (Fleet)
c) Build a distributed key value system (etcd)
d) Build a new container engine (Rocket)
e) Build a network fabric (Flannel)
f) Now embark on building a brand new distributed storage system.
Holy cow that's some goal list. Sounds like something my kids would make up for their Christmas wish list.
I really don't get how their board and investors let them get away with such a childish imagination.
Each one of those is a company effort on its own.
However CAS and immutable file systems aren't that common.
It's not difficult, it's actually impossible. And any system relying on the reliability of that would be broken by design.
> let alone string that into a distributed system that must make CAP tradeoffs
No. Distributed systems and "CAP tradeoffs" actually exist to solve the problem above with predictable certainty.
It is impossible to reliably store something in your local storage. It is possible to achieve some probability of data retention and availability in a distributed system.