Hacker News new | past | comments | ask | show | jobs | submit login
Torus: A distributed storage system by CoreOS (coreos.com)
267 points by philips on June 1, 2016 | hide | past | favorite | 183 comments

It's hard to take this seriously: storage is an excruciatingly hard problem, yet this cheerful description of a nascent and aspirational effort seems blissfully unaware of how difficult it is to even just reliably get bits to and from stable storage, let alone string that into a distributed system that must make CAP tradeoffs. There is not so much of a whisper as to what the data path actually looks like other than "the design includes the ability to support [...] Reed-Solomon error correction in the near future" -- and the fact that such an empty system hails itself as pioneering an unsolved problem in storage is galling in its ignorance of prior work (much of it open source).

Take it from someone who has been involved in both highly durable local filesystems[1] and highly available object storage systems[2][3]: this is such a hard, nasty problem with so many dark, hidden and dire failure modes, that it takes years of running in production to get these systems to the level of reliability and operability that the data path demands. Given that (according to the repo, though not the breathless blog entry) its creators "do not recommend its use in production", Torus is -- in the famous words of Wolfgang Pauli -- not even wrong.

[1] http://dtrace.org/blogs/bmc/2008/11/10/fishworks-now-it-can-...

[2] http://dtrace.org/blogs/bmc/2013/06/25/manta-from-revelation...

[3] http://dtrace.org/blogs/dap/2013/07/03/fault-tolerance-in-ma...

I totally agree with you. I also liked how they said the motivation is to make Google infrastructure for everyone else. How did Google do this? They basically imitated and improved on clustered filesystems developed in HPC. There were a lot of lessons to learn for emerging cloud market in all tooling done in HPC. Some was FOSS.

Whereas, many companies seem to be doing the opposite in their work on these cloud filesystems. They don't build on proven, already-OSS components that have been battle-tested for a long time. They lack features and wisdom from prior deployments. They duplicate effort. They are also prone to using popular components, languages, whatever with implicitly-higher risk due to fact they were never built for fault-tolerant systems. Actually, many of them often assume something will watch and help them out in time of failure.

If it's high-assurance or fault-tolerance, my manta is "tried and true beats novel and new." Just repurpose what's known to work while improving its capabilities and code quality. Knocks out risks you know about plus others you don't since they were never documents. Has that been your experience, too? Should be a maxim in IT given how often problem plays out.

While I strongly agree with bcantrill, I have to disagree with your HPC origins model in the most strenuous fashion.

The HPC world can be thought of as an alternative universe, one where dinosaurs came to some sort of lazy sentience instead us mammals. HPC RAS (reliability, availability, serviceability) requirements and demands are profoundly different and profoundly lower from what enterprise or even SMB (small & medium sized businesses) would find acceptable. There are or were interesting ideas in GPFS but that was long ago, far away, and sterile besides - not that it stops IBM squeezing revenue from locked-in customers.

Some people in HPC do realize their insular state, but mostly there are hopes for things to be done better that were done better earlier (e.g. "declustered RAID") in the "real" world. Their ongoing hype is for yet-another-meta-filesystem built with feet of straw upon a pile of manure.

The likes of Google infrastructure, are built instead on academic storage R&D, which has been a great source of ideas and interesting test implementations - log structured file systems, distributed hash tables, etc. There is some small overlap, for example in the work of Garth Gibson, but it's not a joint evolution at all.

HPC as an alternative universe is a fine metaphor. Yet, I don't see how there's no comparison given HPC already had more RAS than I could get out of any Linux box for years. IBM SP2, SGI, and some Beowulf-style clusters had high-performance I/O, single-system image, distributed filesystems, and fault-tolerance where you wanted. That sounds a lot like what cloud market aimed for. On top of it, you had grid and metacomputing groups that spread it out across multiple sites with ease of use for developers.

They developed in parallel and separately but there's lots of similarities. Many startups in HPC were taking HPC stuff to meet cloud-style requirements for years in terms of management, reliability, and performance. That it wasn't up to the standards of businesses is countered by how many bought those NUMA machines and clusters for mission-critical apps. A huge chunk of work in Windows and Linux in 90's and early 2000's went to trying to match their metrics in order to replace them on the cheap with x86 servers and Ethernet. So, yes, HPC was making stuff to meet tough, business needs that commercial servers and regular stacks couldn't compare to. That's despite number-crunching supercomputers being their main market.

Maybe things have fallen back drastically then or we're using different definitions of HPC (it's the likes of weather or bombs for me).

Supercomputer system MTBF numbers aren't high at all. Supercomputer storage is fault tolerant only by fortuitous accident and often requires week-long outages with destroyed data for software upgrades. These systems are built with the likes of dubious metafilesystems (I'm talking to you, Lustre) sitting on top of shaky file systems on top of junk-bin HW RAID systems almost guaranteed to corrupt your data over time.

I think your overall statement is valuable with the global search/replace of "enterprise" for "HPC" - airline reservations or credit card processing. That's all I'm saying - HPC people (at least currently) are massively overrated as developers and as admins. Maybe it's all just a plan to extract energy from Seymour Cray's pulsar-like rotating remains?

" Supercomputer storage is fault tolerant only by fortuitous accident and often requires week-long outages with destroyed data for software upgrades."

Have you tried Ceph or Sector/Sphere? Lustre is known to be crap while Ceph gets a lot of praise and Sector/Sphere has potential for reliability with good performance. I think you may just be stuck with tools that suck. I'll admit it was about 10+ years ago when I was knowledgeable of this area. It could've all gone downhill since.

"That's all I'm saying - HPC people (at least currently) are massively overrated as developers and as admins. "

I'll agree with that. It's one of reasons field built so much tooling to make up for it. ;)

What does "potential for reliability" even mean? Even Lustre, which you malign and I have maligned even more, has potential for reliability if they just fix a few hundred egregious design flaws.

I haven't had a chance to run it. I also don't have clear data on it's userbase. I just know it's been used in supercomputing centers for a while doing large jobs fast over fast WAN's. So, potentially reliable if I see more users telling me it is in various situations.

Re CEPH vs Lustre: what's performance like? I've seen anecdotes quoting 3 GBps over Infiniband for Lustre.

(I'm curious because I run an HPC installation with Lustre and NFS over XFS, and trying to think of the future. MBTF doesn't matter as much as raw speed while it actually runs.)

At this point, this is really an apples-to-oranges comparison.

Lustre, as truly awful as it is, is a POSIX filesystem (or close enough for (literally) government work).

Redhat/Ceph only at the end of April, announced that POSIX functionality was ready for production. Personally, that's not when I'd choose to deploy production storage. Ceph object and nominally block have much more time in production.

If you need POSIX, trusting Ceph at this point is an issue unless, as you say, MTBF isn't a concern. You might want to try BeeGFS, a similar logical model but much simpler to implement, performance up to a very high level, and a record of reliable HPC deployments (as oxymoronic as that sounds).

If you can do with object then certainly exorcise Lustre from your environment in favor of Ceph (or try Scality if non-OS isn't an issue). Lustre's only useful as a jobs program anyway - keeping people occupied who'd otherwise be bodging up real software.

A proprietary (and solid) alternative to Lustre would be GPFS, which also has a long track record in HPC (and other markets in which IBM thrives).

As someone who completely shares your Lustre sentiment, I can't fathom why Intel keeps pouring resources into it.

GPFS has an amazing number of features, offers high performance and, given a certain fiddliness of configuration and administration, is reliable and performant. It can even sit on top of block storage that itself manages with advanced software RAID and volume management.

The problem (surprise!) is IBM. It's mature software, which means 21st Century Desperate IBM sees it as a cash cow - aggressively squeezing customers - and as something they can let their senior, expensive developers move on from - or lay them off in favor of "rightsourcing". You can certainly trust your data to it (unlike Lustre), but it'll be very expensive, especially on an ongoing basis, and the support team isn't going to know more than you by then. Also expect surprise visits from IBM licensing ninja squads looking for violations of the complex terms, which they will find.

As for Lustre, it brings to mind Oliver Wendell Holmes, Jr's, "Three generations of imbeciles are enough". I've been at least peripherally involved with it since 1999, with LLNL trying to strong-arm storage vendors into support. Someone should write a book following 16 years of the tangled Lustre trail from LLNL/CMU/CFS -> Sun -> Oracle -> WhamCloud -> OpenSFS -> ClusterStor -> Xyratex -> Seagate -> Intel (and probably ISIS too).

The answer to your question IHMO, is that Intel just isn't that smart. They're basically a PR firm with a good fab in the basement. What do they know about storage or so many other things? People don't remember when they tried to corner the web serving market back during the 1st Internet boom. They fail a lot, but until now had enough of a cash torrent coming that it didn't matter. They still do, of course, but there are inklings of an ebb.

Yeah, yeah, GPFS was one of them that inspired my HPC and cloud comparison. It, combined with management software, got one to about 80-90% of what they needed for cloud filesystems. It was badass back when I read about it being deployed in ASC Purple. I didn't know it turned into some stagnating, fascist crap with IBM. Sad outcome for such great technology.

Typical IBM, though. (shakes head)

Sounds like good recommendations based on what research I've done in these things. I forgot BeeGFS but it was in my bookmarks. Must be good in some way. ;)


Found these for CEPH that indicates there's fast deployments for Infiniband. Don't have more data as I've been out of HPC a while.



Also, look at Sector/Sphere which was made by UDT designer for distributed, parallel workloads for supercomputing. It has significant advantages over Hadoop. It's used with high-performance links to share data between supercomputing centers.



Native RDMA support for Ceph is still a ways off. The current implementation requires disabling Cephx authentication, which is a no-go in any environment where you can't completely trust every client (e.g. "cloud", where most current deployments/users live). It also hasn't seen much development since the initial proof-of-concept (still highly experimental).

That said, IPoIB should work just fine, and the main bottleneck currently is (Ethernet) latency. I'm running a couple of 1,6PB clusters (432 * 4TB) and can only get 20-60MBps on a single client with a 4kB block size, but got bored of benchmarking after saturating 5 concurrent 10Gb clients with a 4MB block size.

I do expect the RDMA situation to improve substantially over the next year or so, even if authentication will still be unsupported. The latter generally isn't a problem in HPC where stuff like GPFS lives (where you also have to trust every client). And they clearly want that market now that CephFS is finally deemed production ready.

In the HPC crowd, I'm quite familiar with OrangeFS (aka PVFS2) which recently entered the standard kernel. I had a PVFS 2.7 cluster running for many years, 24/7 with decent reliability (it crashed a few times, but never lost data).

It works with RDMA, has a POSIX layer, and is roughly equivalent to Lustre in performance in my tests, but 1° is very easy to setup (compared to Lustre) 2° has NFS actually working.

Yes, that's absolutely been my experience -- and even then, when it comes to the data path, you will likely find new failure modes in "tried and true" as you push it harder and longer and with the bar being set at absolute perfection. I have learned this painful lesson twice: first, with Fishworks at Sun when we turned ZFS into a storage appliance -- and we learned the painful difference between something that seems to work all of the time and something that actually works all of the time. (2009 was a really tough year.[1]) And ZFS was fundamentally sound (and certainly had been running for years in production) before we pushed it into the broad enterprise storage substrate: the bugs that we found weren't ones of durability, but rather of deeply pathological performance. (As I was fond of saying at the time, we never lost anyone's data -- but we has some data take some very, very long vacations.) I shudder to think about building a data path on much less proven components than ZFS circa 2008, let alone building a data path seemingly in total ignorance of the mechanics -- let alone challenges -- of writing to persistent storage.

The second time I learned the painful lessons of storage was with Manta.[2] Here again, we built on ZFS and reliable, proven "tried and true" technologies like PostgreSQL and Zookeeper. And here again, we learned about really nasty, surprising failure modes at the margins.[3] These failure modes haven't led to data loss -- but when someone's data is unavailable, that is of little solace. In this regard, the data path -- the world of persistent state -- is a different world in terms of expectations for quality. That most of our domain thinks in terms of stateless apps is probably a good thing: state is a very hard thing to get right, and, in your words, tried and true absolutely beats novel and new. All of this is what makes Torus's ignorance of what comes before it so exasperating; one gets the sense that if they understood how thorny this problem actually is, they would be trying much harder to use the proven open source components out there rather than attempt to sloppily (if cheerfully) reinvent them.

[1] http://dtrace.org/blogs/bmc/2010/03/10/turning-the-corner/

[2] https://github.com/joyent/manta

[3] https://www.joyent.com/blog/manta-postmortem-7-27-2015

That was a pretty humble and good read. I don't think I'd have seen the autovacuuming issue coming. Actually, this quote is a perfect example of how subtle and ridiculous these issues can be:

"During the event, one of the shard databases had all queries on our primary table blocked by a three-way interaction between the data path queries that wanted shared locks, a "transaction wraparound" autovacuum that held a shared lock and ran for several hours, and an errant query that wanted an exclusive table lock."

That's with well-documented, well-debugged components doing the kinds of things they're expected to do. Still downed by a series of just three interactions creating a corner case. Three out of a probably ridiculous number over a large amount of time. Any system redoing and debugging components plus dealing with these interaction issues will fare far worse. Hence, both of our recommendations to avoid that risk.

Note: Amazon's TLA+ reports said they model-checkers for finding bugs that didn't show up until 30+ steps in the protocols. An unlikely set of steps that actually was likely in production per logs. Reading such things, I have no hope that code review or unit tests will save my ass or my stack if I try to clean-slate Google or Amazon infrastructure. Not even gonna try haha.

> Note: Amazon's TLA+ reports said they model-checkers for finding bugs that didn't show up until 30+ steps in the protocols. An unlikely set of steps that actually was likely in production per logs. Reading such things, I have no hope that code review or unit tests will save my ass or my stack if I try to clean-slate Google or Amazon infrastructure. Not even gonna try haha.

For those unfamiliar with the reference, there was an eye-opening report from Amazon engineers who'd used formal methods to find bugs in the design of S3 and other systems several years ago [0]. I highly recommend reading it and then watching as many of Leslie Lamports talks on TLA+ and system specifications as possible.

[0] http://research.microsoft.com/en-us/um/people/lamport/tla/fo...

Anyone who's worked with Postgres at scale would guess autovacuum.

Postgres doesn't have many weaknesses, but most of them relate to autovacuum.

"Anyone who's worked with Postgres at scale would guess autovacuum."

Well, there's knowing it's autovacuum-related then there's the specific way it's causing a failure. First part was obvious. The rest took work.

"Postgres doesn't have many weaknesses, but most of them relate to autovacuum."

Sounds like that statement should be on a bug submission or something. They probably need to replace that with something better.

It's known to the postgres developers, and we are working on it. This specific issue (anti-wraparound vacuums being a lot more expensive) should be fixed in the upcoming 9.6.

Awesome! I already push Postgres and praise its team for the quality focus. Just extra evidence in your favor. :)

What triggered me was just throwing out "reed solomon" when talking about random writes. How does that work? We'll read from 5 places to complete your write?

My impression was that they heard reed solomon was used in robust systems like they are describing. They intend to use it in theirs. It will therefore be just as robust. Similar to how some firms describe their security after adding "256-bit, military-grade AES." ;)

They operate on blocks and can implement Reed-Solomon with no issues. Random writes do not matter with the architecture like this. The tricky part would be latency and performance during periods of growth.

Yeah, but they have to read several data/parity blocks, and then rewrite all parity blocks plus one data block, for any write to a given block.

This creates big difficulties for both consistency and performance, and fixes for consistency make the performance worse (and vice versa).

Google's filesystem could use reed-solomon because they're append-only, making consistency a non-issue and performance can be fixed by buffering on the client side.

Torus is append-only too. We also plan to support something more like what Facebook's paper describes, where they have extra parity (xor) to support more efficient local repair.

How? I thought you're exporting a block device, not a filesystem? You can't append to a block device, and certainly every filesystem out there expects block devices to be random-writable, right?

The "interface" we're exporting is very different from the underlying storage. The block device interface we currently provide supports random writes just fine, but the underlying storage we use (which involves memory-mapped files) is append-only. Once written, blocks are only ever GC'd, not modified.

So if I were to run a database on this, wit a lot of overwrites, the storage would grow infinitely?

Secondly, this implies you are remapping the LBA (offsets) all the time, perhaps taking what would be sequential access and turning it into random? That sounds pretty painful.

Nope, previous block versions get GC'd. I don't see how LBAs have any relevance here... you're talking about a much lower layer than what Torus is operating on.

You're providing a block device interface to the container. The container's FS is addressing LBAs. Sequential reads to the container's adjacent LBAs get turned into reads to whatever random Torus node is storing the data, based on when it was last written...

Exactly what you said. Torus is exposing block on top of what could be described as a log structured FS. So while you may not know about LBAs, there are LBAs involved. I took a look at the code and you are putting a FS like ext4 on top of your block device. Any time an LBA is written to, you append to your store. This causes sequential access to become random, and in addition causes unneeded garbage collection issues.

Further more, it appears to me that etcd is now in the "data path" That is, in theory, each access could end up hitting etcd.

If so, I really would question why anyone would do this at all... this is not how any storage system is written.

The problem here is that you are trying to do block on a file system. This is a bigger problem than you can imagine and while you may think lbas are not involved, there actually are. You are naively taking on a well known area in storage

Ok, so that plus a little MVCC can make you consistent, but you've still got the read-many-to-write-one thing from the perspective of your block device interface, right? And block devices, if I'm remembering right, don't leave you any room to buffer pending writes.

Torus implements a kind of MVCC, yes. As for read-many-to-write-one, I assume you're talking about Reed-Solomon or similar erasure coding? There have been some papers written about ways to reduce that, a good one is from Facebook: https://code.facebook.com/posts/536638663113101/saving-capac.... And that's just one option. Also, this is all speculative since we have yet to implement erasure coding.

Don't know if you'll see this, but:

If only one host at a time has access to a given virtual block device, there are some opportunities to buffer outgoing writes with a write-through cache. That might be the way to go if you explore erasure coding in the future.

Well, good luck with it. Don't get me wrong, I'd love 1.5x redundancy overhead instead of 3x. But even if you have to downgrade to offering either replication or XOR, it's still a huge missing piece of the typical container deployment, so good luck.

Agree 100% on this, Bryan. Their claims about how existing storage solutions are a poor fit for this use case are completely false. As too often happens, they address only high-margin commercial systems and ignore the fact that open-source solutions are out there as well. (Disclaimer: I'm a Gluster developer). Then they leave both files and objects as "exercises for the reader" which shows a total lack of understanding for the problem or respect for people already working on it. Their announcement is clearly more about staking a claim to the "storage for containers" space, before the serious players get there, than it is about realistic expectations for how the project might grow. Particularly galling are their comments about the difficulty of building a community for something like this, when such communities already exist and their announcement actually harms those communities.

Until now, I've thought quite highly of the CoreOS team. Now, not so much. They're playing a "freeze the competition" game instead of trying to compete on merit.

How is them developing their own technology not competing on merit? Did they steal the technology? Did they claim that anyone ought to use this in production? Why all the negativity? Yeah never mind let's just discourage everyone from trying to build new technology. Nobody is forcing you to use this. If it's not for you, move on. Going out on a rant about what you think their intentions are is ridiculous.

Really you come off as defensive. I don't understand how CoreOS has disrespected anyone by coming up with their own approach to the problem.

They're not writing blog posts or comments on HN, they're writing code.

> They're not writing blog posts or comments on HN, they're writing code.

Actually, the problem is that 90% of the code still remains to be written, while others (including me) have already done so. They've addressed only the very simplest part of the problem, not even far enough to show any performance comparisons, in a manner strongly reminiscent of Sheepdog (belying your "own approach" claim). That's a poor basis from which to promise so much. It's like writing an interpreter for a simple programming language and claiming it'll be a full optimizing C++ compiler soon. Just a few little pieces remaining, right?

It's perfectly fine for them to start their own project and have high hopes for it. The more the merrier. However, I have little patience for people who blur the lines between what's there and what might hypothetically exist some time in the future. That's far too often used to stifle real innovation that's occurring elsewhere. Maybe it's more common in storage than whatever your specialty is, but it's a well known part of the playbook. It's important to be crystal clear about what's real vs. what's seriously thought out vs. what's total blue-sky. Users and fellow developers deserve nothing less.

Actually running just:

    torusctl init
    torusctl -C $ETCD_IP:2379 init
   ./torusd --etcd --peer-address --data-dir /tmp/torus1 --size 20GiB
to have a near production ready system is way easier than setting up glusterfs (even as a demo) and ceph. A distributed system doesn't need to be complicated.

Actually that's almost exactly the same steps as for GlusterFS.

  > gluster peer probe ...
  > gluster volume create ...
  > gluster volume start ...
But that's not even the point. You're right that the interface to a distributed storage system doesn't need to be complicated, but the implementation inevitably must be to handle the myriad error conditions that will be thrown at it. Correctness is even more important for storage than for other areas in computing, and something that only implements the "happy path" for the simplest data model or semantics is barely even a beginning. The distance between "seems to work" and "can be counted on to work" is far greater for this type of system than for most others. I think it's important to understand and communicate that, so that people don't develop unrealistic expectations. That way lies nothing but heartbreak, not least for the developers themselves. It's far better for everyone to set and meet modest goals than to make extravagant promises that can't be kept.

Also, they are writing comments on HN. As are you. What exactly is your point?

If I could also add that NBD is extremely notorious in Linux. I am a block storage developer and NBD has been one of the main reasons why so many openstack storage products (like formation data and others) have really struggled. There are many known Linux kernel issues with NBD, for example, if an NBD provider (user space daemon) exits for any reason, the kernel panics. Here is an example of a long outstanding bug that has plagued the NBD community and openstack community for years (https://www.mail-archive.com/nbd-general@lists.sourceforge.n...). It won't get addressed any time soon. I also looked at the repo and it looks like a very simplistic approach to a hard problem. BTW I have used etcd and if this is anything like etcd, I'd really be worried. etcd snapshots bring an entire cluster down for a while.

Here is an example script that will panic anyone using NBD to server a block device (replace qemu-nbd with the nbd export provided by torus)

qemu-img create -f qcow2 f.img 1G mkfs.ext4 f.img modprobe nbd || true qemu-nbd -c /dev/nbd0 f.img mount /dev/nbd0 k killall -KILL qemu-nbd sleep 1 ls k

Above script with correct formatting:

  qemu-img create -f qcow2 f.img 1G
  mkfs.ext4 f.img
  modprobe nbd || true
  qemu-nbd -c /dev/nbd0 f.img
  mount /dev/nbd0 k
  killall -KILL qemu-nbd
  sleep 1
  ls k

But: that's not a kernel panic, just a BUG() stack trace from kernel which doesn't halt the system.

Looking at dtrace, fishwork, zfs and Solaris (imo was the best OS technologically speaking) it's interesting to see how strong and innovative its engineering team was while the business was just going down. How did that happen? How can the engineering team be that productive and functional while the business vision was so lacking?

Even when commercially misguided, Sun always had terrific engineering talent -- and my farewell to the company captures some of that.[1] In terms of why did the company fail, the short answer is probably that SPARC was disrupted by x86, and by the time the company figured that out, it was too late to recover.[2]

[1] http://dtrace.org/blogs/bmc/2010/07/25/good-bye-sun/

[2] Longer answer: https://news.ycombinator.com/item?id=2287033

I've always been very impressed by Sun's talent, the team was . And thank you for dtrace! :) If only Linux community / leaders would be less arrogant and adopt technologies from Solaris/BSD that are order of magnitude better (kqueue, netgraph and more) instead of coming up with new ways to screw up.

Sun should be resurrected now given that risc is leading the way and build everything on top of ARM! :). if only.

Linux can't be blamed for the decision to place Sun's technologies under the CDDL.

there's no license for kqueue's interface or netgraph, and many other great interfaces. but they decide to create square wheels and not leverage from errors that others have made before them.

You're right.

I was thinking mostly of the eternal buzzkill that ZFS is only usable through indirect means.

Just how many things is CoreOS trying to do? Last I counted, they want to

a) Build a distributed OS

b) Build a distributed scheduler (Fleet)

c) Build a distributed key value system (etcd)

d) Build a new container engine (Rocket)

e) Build a network fabric (Flannel)

f) Now embark on building a brand new distributed storage system.

Holy cow that's some goal list. Sounds like something my kids would make up for their Christmas wish list.

I really don't get how their board and investors let them get away with such a childish imagination.

Each one of those is a company effort on its own.

Fleet died once Kubernetes became a thing, even though they serve entirely different use cases.

Actually the biggest problem is actual either Full Posix or Mutable Storage. If there is just one way in and never a way out, it's actually way easier. So practically if you don't run your database on shared storage etc and manage it "the old way". you could have something reliable by a 'never delete'.

However CAS and immutable file systems aren't that common.

Their one saving grace might be that NBD is such a simple protocol. By not trying to make it a full-fledged FS, they might actually have hit on a tractable problem.

> unaware of how difficult it is to even just reliably get bits to and from stable storage

It's not difficult, it's actually impossible. And any system relying on the reliability of that would be broken by design.

> let alone string that into a distributed system that must make CAP tradeoffs

No. Distributed systems and "CAP tradeoffs" actually exist to solve the problem above with predictable certainty.

It seems you're disagreeing with yourself.

Care to point out where? I can explain if something seems too ambiguous.

It is impossible to reliably store something in your local storage. It is possible to achieve some probability of data retention and availability in a distributed system.

Just as a separate question: why are you so bitter about btrfs? ;)

Good... Good... Let the hate flow through you.

An open source OS company just blogged about a new open source project that they are putting resources into. They did not release a commercial product that competes with any existing storage solution.

How exactly would one expect a new project to be announced? Tell me more about how far away from done you think they are.

Sorry if that's a bit sarcastic, but seriously wouldn't it be nice if these comments where more along the lines of: "Interesting new project, here's some issues we've run into vis a vis reliable storage that you should look out for..."

Yes, I have to agree. There's a little too much negativity for my liking, especially from people who have a vested interest in downplaying the efforts of others.

I absolutely agree, all those negative comments sound like rants from grumpy old guys. I have worked in HPC distributed storage and the problems the CoreOS people are trying to solve is a valid and present one: deploying flexible distributed storage today is an absolute pain.

Plus all the arguments about local storage failure being hard are moot because this is a distributed solution and the hardship of local storage reliability can be abstracted away almost completely.

I for one find it very positive, a storage engine implemented in Go, for what I could gather in a quick Github glance.

For me, this is surely systems work.

IMO it is because this can derail the adoption of containers. I think an immature project like this announced so prematurely does more harm than good. And I appreciate people that have experience with this bringing up the real issues.

I love CoreOS and they've done some super impressive engineering. But really, a new storage system? Rewriting in Go and using etcd for central state management makes things easier, but this is still a hard problem.

Some things that need to be solved sooner or later: data replication so that N faults of X entities are protected against (X can be disks, enclosures, racks, data centers, regions, ..), recovery from failed disks, scrubbing, data management, backups, some kind of storage orchestration and centralized management.

If you look at Ceph which IMHO represents the state of the art in software defined storage, it took many years to get it to a point that it was usable. I hate to be cynical but in this case I would be surprised if CoreOS can pull this of. I wish them all the best though and would be happy for them if I am wrong.

> this is still a hard problem

Which is why I'm rather surprised that these systems continue to be released without any formal specification or model checking. It's hard to get these things right... worth the effort, IMO, to create formal specifications.

> But really, a new storage system?

I think the post outlines quite well why they chose to build a new distributed storage layer. From a product perspective it makes a lot of sense for them to have an offering in this space. Also, the fact that it's a hard problem is what makes it a compelling product offering. You can't make much money solving easy problems.

> Also, the fact that it's a hard problem is what makes it a compelling product offering. You can't make much money solving easy problems.

If they can solve the hard problem, yes. Which I believe is unlikely.

It's like any time we take a step forward in one area we have to reinvent the last 50 years of computing to support it.

Persistent storage is a "hard problem" Really?

Persistent storage isn't a hard problem. Distributed, well performing, scalable and consistent storage is a hard problem.

Actually, persistent storage is fairly hard in itself. Look at what ZFS does to ensure data integrity in the face of phantom writes, dropped writes, bad controllers, and other implicit, non-fatal failures.

ZFS still also have the issue of having to perform well. You have a point, but ZFS is still trivial compared to a proper distributed filesystem, and you could achieve the same reliability much easier than ZFS if you sacrificed the performance.

The ClusterHQ guys behind the FlockerHQ found this out the hard way [0]. Initially Flocker was meant to provide a container data migration tool on top of ZFS, now it is a front-end to more established storage systems like Cinder,vSan,EBS,GPD and so on.

[0] https://docs.clusterhq.com/en/latest/faq/#what-happened-to-z...

Absolutely -- I didn't mean to imply that ZFS even comes close to solving the distributed parts of the problem, but rather that a distributed storage system does have to address the problems of putting bits on disk.

> Actually, persistent storage is fairly hard in itself

Don't we have distributed data storages precisely because it's impossible to guaranty persistence locally? It's kind of a way to not bother trying to solve the impossible, but to achieve some guarantees on a different level.

That's one use case, but far more common is reducing latency and increasing bandwidth for distributed computing acting on a shared set of data.

Persistent storage on today's filesystems and complex hardware is a hard problem. All kinds of failures can happen during any write. Some are obvious with some silently corrupting data. There's been decades of work on approaches to dealing with this with a variety of tradeoffs. Picking the right one for a widely-deployed, portable, distributed app is tricky by itself.

They're aiming to do a lot more than that. ;)

I've been trying to find performance evaluations of distributed file systems, and in most tests I've seen Ceph is a lot slower than alternatives like GlusterFS.

When you say Ceph is "state of the art in software defined storage", are you including performance, or are there advantages in features or reliability where you think it outclasses the competition?

The CoreOS team is excited to make this initial release and start collaborating with folks that want to tackle distributed storage.

tl;dr this is a new OSS distributed storage project that is written in Go and backed by etcd for consistency. The first project built on top of Torus is a network block device that can be mounted into containers for persistent storage. It also includes integrations out of the box for Kubernetes "flex volumes". Get involved! :)


Brandon, a few months ago I met you at a meetup and you did not think this problem needed to be solved, and containers were for ephemeral apps... just an AMA request, what changed hearts?

What's the current largest-scale deployment?

It's described as a prototype, so I'm pretty sure there's no deployment at scale at all.

Storing container images is a great use case for an object store, not a block store. They should get sucked down whole to ephemeral storage on the compute nodes.

Mounting containers to a distributed block system is an anti-pattern. This is going to go poorly.

Ceph has had some really smart people working on distributed block for a lot of years, and they still have significant issues. It's not because they're dumb, it's because performant, scalable, and available distributed block is either hard or impossible.

This is about providing data volumes to containers, not about hosting container images or mounting containers themselves.

> At its core, Torus is a library with an interface that appears as a traditional file, allowing for storage manipulation through well-understood basic file operations. Coordinated and checkpointed through etcd’s consensus process, this distributed file can be exposed to user applications in multiple ways. Today, Torus supports exposing this file as block-oriented storage via a Network Block Device (NBD). We also expect that in the future other storage systems, such as object storage, will be built on top of Torus as collections of these distributed files, coordinated by etcd.

Am I understanding correctly that this is a file-based API? Distributing a POSIX filesystem effectively is very challenging, particularly since most applications that use them aren't written with CAP in mind; they don't expect a lot of basic operations to block for extended periods and fail in surprising ways, and they very often perform poorly when operations that are locally quick end up much slower over a network.

To be concrete:

> Today’s Torus release includes manifests using this feature to demonstrate running the PostgreSQL database server atop Kubernetes flex volumes, backed by Torus storage.

It will be interesting to see how well this performs and how it behaves in the face of single-node failures and network congestion.

It isn't a POSIX filesystem API. It is like a single file or a big distributed tape.

Virtualized, network block devices have all the same problems I described -- even worse, because the abstraction coneys even less about what an application is trying to do.

No. The difference is, that with network block devices you are usually only allow accessing the block device once. That's an easier problem than mapping POSIX file system semantics!

I don't understand why this is being used over say librados + librbd from Ceph. This seems like an awful lot of work to get the same functionality that Gluster/Ceph already have. Is there something I'm missing here?

A few major things we wanted to accomplish:

Backed by etcd. Today etcd is a well tested and widely used consistent store with users like Kubernetes, Flannel (now Canal), Fleet, SkyDNS and many others. Building something like etcd requires tons of testing and etcd is becoming the solid go to for this category of distributed problems.

Easy to work on code base. Building an OSS community around a complex technology is really hard. It shouldn't be underestimated that a code base that can be built on OS X, Linux, and Windows will have an advantage in getting people involved and maturing the project.

Designed for other storage application. Torus is backed by a distributed storage interface that is exposed via gRPC (with the opportunity for language bindings in lots of languages). Our hope is to let people build write-ahead log systems, object systems, or filesystems on top of this abstraction.

Flexibility built into the storage layer. Over time we want people to implement different hash maps or explicit configuration of storage layout. This is an abstraction that we have today and intend to build out over time with feedback from users.

Happy to answer more questions. This is a young project and we look forward to working with people and hearing about what they want out of a storage system. Overall, we want to build something with folks that is easy to work on and backs itself with technologies that are getting good operational understanding in the "Cloud Native" space.

I've managed to lose data with Etcd, and have regularly had issues with membership issues requiring maintenance. Meanwhile I've had Glusterfs volumes remain available for 5-6 years without maintenance at all.

To me at least, having it "backed by etcd" is a big red flag, not a feature.

I agree. I was also thinking of Ceph when I wrote my reply to Cantrill. Esp Ceph on XFS given maturity of both plus enormous effort that's gone into XFS. Sector/Sphere with UDT is also kick-ass. It took both of these a long time to get to where they were fighting all kinds of unforseen issues. They also tried to build on proven components that themselves were battletested for years.

So, this company is trading away stuff like that for custom components and Etcd? And for a component focusing on integrity and availability? Huh?

  I've managed to lose data with Etcd
How did you manage to lose data?

Use etcd in production and you will lose data. Bounce nodes, simulate power loss and you will lose data. Have you actually used it?

I don't remember the details, and to be clear this was not with the most recent version of Etcd at all - it was quite a while ago. I think and hope that whatever the problem was is no longer an issue. It has certainly gotten a lot better. My point is mainly that it is way too young to be something to trust important data to. We didn't either - what we lost was a cluster configuration that we could recreate from a combination of backups and redoing a handful of operations, but at the time it was scary to see, and prompted us to be very careful about what we put into Etcd going forward.

> Backed by etcd. Today etcd is a well tested and widely used consistent store with users like Kubernetes, Flannel (now Canal), Fleet, SkyDNS and many others. Building something like etcd requires tons of testing and etcd is becoming the solid go to for this category of distributed problems.

Being built atop something else doesn't mean it inherits all the stability of the underlying system; it's an upper bound not a lower bound.

It sounds like etcd is in the I/O path. Is that correct?

It's in the sync() path, but not in the I/O path.

Why is it in the sync path? That seems unnecessary, as well as bad for both performance and correctness (another thing in any path means another potential for partial failure).

This is not a great analogy since their first application is block storage, but think about a journaling file system. Typically file data is not journaled, but metadata is. By having a consistent view of the metadata, the entire filesystem (as far as you can interact with it) is consistent. That consistent journal is the same primitive that etcd is providing.

I don't think the two cases are analogous. First, block devices don't have any metadata to speak of. Second, filesystems have journals to "cover up" for the main-store operations being asynchronous and/or non-atomic. However, sync() is completely synchronous by definition - hence the name - so such "covering up" would be superfluous. There must be some metadata that's being written at etcd because it needs to be read from there later, but there's nothing in block-device semantics to require any such thing.

Thinking about it further, I think I can guess at what's going on here. The key observation is that there's no sync() at the block device level. It's a filesystem operation; block devices don't see it. Sure, there are queue flags and FUA and such, but those are different (and I'm not sure any of those exist in NBD). Where is this sync() path? I'm guessing it's internal on the data servers, to deal with data that's being buffered there. With both replication and erasure coding, correct recovery requires exact knowledge of what has been fully written where, and that's the kind of metadata I suspect is being put in etcd. There's not even necessarily anything wrong with it, unless updating that information only on sync() means that supposedly durable writes since the last (unpredictable to the client) sync could be lost on failure. I hope that's not the case.

Maybe I'll find time, in the midst of my work on an actual production-level distributed filesystem, to look at the code and see if my guess is correct.

Block devices (and NBD specifically) absolutely have a notion of sync(). We use sync() as the unit of write visibility. All writes up until a sync are effectively anonymous until a sync().

Please go look at the man page for sync(2), which is also the page for syncfs(2). It is explicitly a filesystem-level operation. Obviously, this will cause data to be flushed from the filesystem down to lower layers. Obviously, you can "sync" virtual (e.g. NBD or loopback) block devices by syncing the filesystems that contain their backing stores, but that's not the same thing. No filesystem, no sync(2). For block devices with no file-based backing store, sync(2) is inapplicable. Also, sync(2) is latency-inducing overkill if you're trying to ensure durability for anything less than all filesystems attached to a machine. More often, fsync(2) on the backing files is what you should be using.

> All writes up until a sync are effectively anonymous until a sync().

"Anonymous" means nothing in this context. Do you mean non-durable? First, please use the correct term. Second, if that is what you mean then you're probably doing it wrong. File writes are allowed to be asynchronous (unless O_SYNC and friends). Block device writes are expected to be synchronous, or at least to preserve order. This is exactly the kind of thing that needs to be thoroughly thought out before code is even written, and that thinking should be spelled out somewhere for people to help make sure all the nasty corner cases are covered. Your cart is way before your horse.

I kind of see this as a general trend where CoreOS is concerned they reimplement everything in house, even when open source solution might already exist - rkt, etcd, now this.

Rkt had good reasoning behind it, and I think it's reasonable to assume that a lot of the recent changes to Docker came because they forced Docker's hand.

For Etcd I'm of two minds. I like its simplicity. On the other hand, I have had far more problems with Etcd than I've ever had with e.g. Consul (though I dislike Consuls "and the kitchen sink" approach), up to and including data loss and more than once having to re-initialise clusters. I'm a couple of years of no further problems at least a way from trusting Etcd with my data.

For this? They have to have very compelling stories for why they couldn't e.g. modify Ceph or Gluster to achieve whatever it is that they want. And it's no clear what they want. Gluster for example is built as a stack of pluggable translators that is relatively easy to extend.

Even if they have a credible and compelling story for why we need something else, it will take years of production use by others before I'd trust any data to a new distributed file system.

Gluster and Ceph have those years. I've had a Gluster volume running without maintenance or data loss for more than five years. Even then it took several years before I truly started to trust it.

Have you ever thought the etcd issues was just because how you operate it? Or maybe you were running an old version of it? Have you reported to upstream? If there is a very common issue that you can meet frequently, it should have been fixed or you should have reported it.

I'm quite sure the etcd issues was "just because" of how we operated it. But the point is that I have had these issues with Etcd, using the default CoreOS configuration of Etcd, and I've not had these issues with Gluster despite having Gluster volumes running many times as long.

Gluster had many issues too when it was a young project, but they were fixed many years ago. I don't doubt that Etcd too will become solid enough for me to trust, but looking at part of the docs [1] we still have "gems" such as "Permanent Loss of Quorum Requires New Cluster" (you don't lose your data to that one, but it's painful nevertheless).

And yes, I've run into that one. Not "permanent" as in "we could never recover the machines in question", but "permanent" as in "we're not going to be able to get this sorted fast enough, so let's fail everything over". Everything else we run can handle that scenario easily. We know it means consistency issues, but if you run into that situation we have taken the decision that they are acceptable and will be resolved later if necessary. The last thing I need in a scenario like that is for some critical component to just refuse to run without requiring lots of manual intervention.

All we'd like would be to be able to say "ok, drop all the other members and pretend all is well, and yes we know what we're asking". There are a number of gotcha's like that with Etcd that sounds find until they bite you in the ass.

They'll resolve these things eventually, but storage is the very last place I'd be willing to take risks with it. So until they have years of track record of running flawlessly, with less hassle than Gluster or Ceph, and substantial advantages, I'll see it as the dangerous choice.

[1] https://coreos.com/etcd/docs/latest/runtime-reconf-design.ht...

Just to be clear: your complaint is etcd won't let you force the cluster into an inconsistent state to mask your poor infrastructure decisions and therefore etcd is a "dangerous choice"?

The "poor infrastructure decision" would be to assume that because there's a network split it is automatically going to be unsafe to, for example, keep scheduling containers in each data centre, even the minority, without knowing whether or not constraints have been put in place to ensure this is safe to do.

How many separate data centres do you want me to host in? Two is certainly insufficient, as if the data centre with the majority of Etcd nodes now falls of the network, you're SOL.

Three? Think you can't have two fall of face of the earth the first time? Been there. Redundant connections doesn't always help if there are major external connectivity issues and you don't have multi-million dollar budgets for connectivity alone.

Most small to mid size companies do not have the budget to run a system that is sufficiently redundant to be able to guarantee that they won't sooner or later be in a situation where either 1) the most viable partition is the minority, or 2) the Etcd cluster is centralised enough that too large parts of the cluster can loose access to it. When fighting large scale failures is the last time where you want to have to fight your tools to be able to convince them that you know what you're doing.

For those kind of situations, Etcd's lack of support for more fine-grained control of consistency or for intentionally breaking apart the cluster and letting them continue to operate independently when you know what you're doing basically makes it unsuitable for running multi-data centre clusters for anything important.

Which basically means any tools that assumes a single underlying Etcd cluster is appropriate as storage, for, say, a cluster scheduler (cough) becomes an inappropriate tool for those kind of setups.

Yeah, I've been smelling NIH for quite a while.

And it generally seems that, when they say "composable modular tools", one should read "includes dependencies on the rest of our stack". This isn't intended as harshness. It is just that every time a tool from COS seems interesting enough to look in to, it turns out that I wouldn't be investing in a tool, it would be investing in an ecosystem that duplicates a ton of what I'm already managing.

Kubernetes is a big exception to this, it seems. Before Kubernetes was announced, CoreOS was actively developing fleet. After Kubernetes, fleet development was largely halted and is now just a "distributed init system for low-level cluster components," e.g. a way to bootstrap Kubernetes itself. I asked the lead developer of fleet if they would have created it if Kubernetes had existed at the time, and he said no. Kubernetes has turned into a big part of the CoreOS pitch, and they've contributed a lot to the project. Giving up a project you started in house because you recognized that another organization's project fit the bill better is very far from NIH.

what is the existing alternative for storage in distributed container environments?

If Lustre is the answer, you're probably asking the wrong question. Unless that question involves short life-span, massively parallel swap-outs from one weapons simulation to another.

I write this not as storage religion (of which there's far too much), but to warn away those who haven't experienced the many kinds of data (and stomach lining) loss that come with being a Lustre admin.

ObjectiveFS[1] is another option if you have access to an object store (e.g. S3 or GCS) from your containers.

[1] https://objectivefs.com

Add this to your list:


I swore I added it to the Wikipedia links. I'll have to do it again, I guess.

Have a look at Open vStorage (http://www.openvstorage.com). Some highlights: open-source, core is battle proven for more than 7 years, performance, scales across datacenters , unlimited snapshots, ...

(disclaimer: I'm an engineer there...)

Competition is good, even in open-source land. By the way, I'm still hoping for an open-source competitor for elasticsearch.

> I'm still hoping for an open-source competitor for elasticsearch

I thought Elasticsearch was open-sourced, and the GitHub repo says Apache v2. What am I missing here?

Nothing. It is open-source. But it needs an open-source competitor. There's too little happening at the open-source side of search.

Huh? It has a major competitor: Solr. I'm not sure what you mean unless you object because they both use lucene.


What about Solr? They're both based on Lucene but I imagine it qualifies.

See also: Piston Cloud.

Okay, so from reading the developer docs at https://github.com/coreos/torus/Documentation, I'm inferring that:

* Single reader/writer. If you try and set up more, Bad Things happen (even if it's single-writer, you don't get read-after-write consistency).

* It sure sounds like network partitions will allow all kinds of badness.

* If copy 1 goes down, you can keep operating on copy 2, then lose copy 2, have copy 1 come back up, and then warp back in time? Maybe that's prevented because of append-only and you just lose the data entirely because it was only replicating to one node due to the "temporary" failure.

Most egregiously, the Architecture description of an Inode implies that persisting a write-to-disk requires persisting the "INode". INodes are persisted in etcd. Which means your entire cluster's write-to-disk throughput is limited to what you can push through a Raft consensus algorithm.

Look, there are all kinds of reasons one could legitimately decide that none of the existing scalable block storage systems satisfy your use case. Maybe containers really are different enough from VMs. But the blog post claims that there just aren't any solutions; the research papers cited in the Documentation page are mostly old and are about systems in a very different part of this sub-space; and what developer documentation exists does not encourage me that this is a good idea.

Granted, I've been working on Ceph for 7 years and am a bit of a snob as a result.

They accept data loss, yes, and haven't figured out consistency yet. But given that it's an early prototype I would expect them to evolve and change things a lot in a couple of years, possibly throwing away raft and etcd.

Good work on Ceph, by the way. I've been following your work since it was a PhD if I remember correctly.

"haven't figured out consistency yet" is the issue. Seemingly very small decisions have huge impacts that you don't expect, and the state of the art is far enough along that you don't get better than the existing solutions except by exploiting newly-identified workload characteristics (or acceptable ways of losing data, loosening consistency, etc based on the workload) that you've planned out to make use of ahead of time.

One example: Both Gluter and Ceph have erasure-coded storage. Gluster's looks just like the replicated storage, only it involves more nodes and less overhead. Ceph's is severely limited in comparison: it's append-only, it doesn't allow use of Ceph's omap kv store, "object class" embedded code, etc. The reason is because distributed EC is subject to the same kind of problem as the RAID5 write hole: if a Gluster client submits an overwrite to 3 of a 4+2 replica group and then crashes, the overwritten data is unrecoverable and the newly-written data never made it.

Torus won't hit that particular issue because it is log-structured to begin with, which has all kinds of advantages. But garbage collection is really hard! Much harder than seems remotely reasonable! Getting good coalescing and read performance is really hard! Much harder than seems remotely reasonable! There's one big existing storage log-structured distributed storage system which has discussed this publicly: Microsoft Azure. They have a few papers out which hint at the contortions they went through to make block devices work performantly — and Azure writes first to 3-replica and then destages to the log! They still had performance issues!

https://github.com/coreos/torus/blob/master/Documentation/re... points to a bunch of HDFS research and replacements; HDFS is designed for the opposite (large files with high bandwidth and nobody-cares latency) of what I presume Torus is targeting (high IO efficiency, low latency). Mostly the same for the Google papers they cite. There's no mention of Azure's storage system papers, nor of Ceph, nor anything about the not-paper-publishing-but-blogging stuff from Gluster or sheepdog; nor from academic research into VM storage systems (there's tons about this!).

Can they fix a bunch of this? Sure. But the desires they list in eg https://github.com/coreos/torus/blob/master/Documentation/ar... go towards making things worse, not better. They aren't talking about how etcd can be in the allocator path but not the persistence path[1] and how every mount should run a repair on the data to deal with out-of-date headers. They talk about adding in filesystems, but not any way of supporting read-after-write (which is impossible with the primitives they describe so far, and really hard in a log-structured system without synchronous communication of some kind). They discuss network partitions between the storage nodes, and between the client and etcd; they don't discuss clients keeping access to the disks but losing it to etcd.

[1] Using etcd for allocation would be a reasonable choice, but putting it in the persistence path is now. Right now a database in your container would require two separate write streams to do an fsync:

1) data write. It doesn't say in docs and I didn't look to see if replication is client or server-driven, but assuming sanity the network traffic is client->server1->server2->server1->client, with a write-to-disk happening before the server2->server1 step.

2) etcd write. Client->etcd master->etcd slaves (2+)->etcd master->client, with a disk write to each etcd process' disk before the reply to etcd master.

This is a busy-neighbor, long-tail latency disaster waiting to happen.

These are very good points, but also probably much more constructive as GitHub issues, where they can be either answered or addressed. In the meantime, hopefully I can talk to a few of these:

>haven't figured out consistency yet I don't recall this being the case, but I'm not the authority on the matter.

>I presume Torus is targeting (high IO efficiency, low latency) The best-possible performing storage solution is _definitely_ not our primary goal, though we'll take it where we can get it. The most important goals for the project are ease of use, ease of management, flexibility, and correctness when the use-case desires it. Please note that the block device interface is only one of many planned. The underlying abstraction was designed (and will be improved) to support other situations.

>Using etcd for allocation would be a reasonable choice, but putting it in the persistence path is now. Right now a database in your container would require two separate write streams to do an fsync In Torus today (with the block device interface specifically), and with the caveat that I'm not the authority, so I may be slightly wrong, calling sync(), fsync(), and friends result in what I think you would consider an "allocation". Writes happen against a snapshot of the file (in this block storage case, the block volume is the "file"), and then a sync() makes those changes visible as the "current" version. Syncs hit etcd, writes do not.

I would really encourage you to submit feedback like this in GitHub issues. The project is still in _very_ early stages, and legitimate feedback can actually make a difference.

Where's the formal specification of the system? If we want these distributed systems to be reliable how can we be sure our algorithms and processes work if we do not first model them?

I'd be happy to work on a specification with the community.

We haven't gotten that far down the path. We would love the help. I believe there are some issues related to documenting the architecture and failure domains for the v0.1.1 release: https://github.com/coreos/torus/milestones/v0.1.1

This is one of those classic "release too early or release too late" sort of things that got cut in preference to get early feedback and community participation.

Coming from the company that raised hell about Docker's lack of a spec, this is an interesting approach.

Is the goal to provide POSIX file access as NFS and AFS do? The mention of object storage as a future direction feels like the early "the sky's the limit" euphoria of a new project that has yet to adopt specific end goals.

The project doesn't provide a POSIX filesystem directly. But, instead provides a block device (think AWS EBS).

The initial goals that are in this release:

1. An abstraction for replicated storage between machines in a cluster.

2. A block device "application" on top that an ext4 filesystem could be run on. This is like an EBS.

In the future someone might build other applications like "object storage" or filesystems and we would love to get the feedback on the API to do that. But, that isn't in the initial goals and we will be focusing on the storage and block layers for the time being.

Will each disk in the system be weighted so that different size disks can be used in the system over time? OpenStack Swift offers this and is useful to not lock-in disk size at design time.

More than "sky's the limit" -- early versions had POSIX access, though it was terribly messy. We know the architecture can support it, it's just a matter of learning from the mistakes and building something worth supporting.

Seems really nice. Would love to see it how it compares to Ceph, we were considering Ceph for our container storage. For starters, it seems much more simpler to deploy.

In my old firm we managed to get ceph running on coreos using ceph in docker. The storage was exposed using a s3 gateway.


Docker supports native RBD since version 1.8. Which should be both faster and more reliable than using the S3 gateway (though that has other benefits and becoming pretty good too).

I work at a company where object storage is something they have worked on for many years to get it stable. You can download it for free (it's Open Source)! Open vStorage is very easy to deploy through Ansible! Try it :)! https://www.openvstorage.com/ https://github.com/openvstorage

I'm always in awe and grateful for the teams like CoreOS that tackle these tough problems.

But when it comes to container storage I'm confused.

The 12 factor app methodologies work very well for every service I have written and supported.

Processes/containers get ephemeral storage. When you need persistence you delegate to a stateful service, generally Postgres and S3.

Id much rather write my apps around these simple and well understood constraints then depend on magic file systems.

The average 12 factor app doesn't really need this, but as you mentioned, sometimes you delegate to Postgres, or S3, and today those are available as hosted products via Amazon, Google, Heroku, etc.

However, when your running in your own datacenter, you don't have the luxury of using Amazon's hosted products, so Torus exists to help provide some building blocks to build your own S3, or potentially even RDS.

Congrats on the relaase! How does this compare to GlusterFS (loved it, was free - I used it before they got acquired by RedHat) and Isilon (expensive but fast and reliable). Is there an upper limit on number of nodes(?) or total size?

So is this the building blocks for an in-house EBS?

Yes, that is the idea. Torus block is like EBS. Torus "library" could be used by other applications like an append only log, object store, etc.

How do you say that with conviction? I looked at the code and it has no bearings of being able to build a fault tolerant elastic block storage system.

No mention of Jepsen tests, I look forward to a @aphyr call me maybe talk/blog post detailing all the issues.

This reminds me of the SheepDog storage architecture (https://sheepdog.github.io/sheepdog/) SheepDog provides safe distributed block devices and uses corosync or Zookeeper instead of etcd.

Despite all the negative sentiment here, I am super excited about this. I use CoreOS heavily and really like how everything just works. Running Kubernetes on it, is the first cluster solution for me that works without configuration orgies and is robust against machine outages. Torus seems to be the missing piece. For now we use local volumes with sidecar containers for r/o storage and nfs volumes for r/w storage.

All other solutions are not practical. GCE and EBS are only single mount. iSCSI is unsupported in the cloud. Leaving only Ceph and Glusterfs, both mentioned here, but needing heavy configuration.

Gluster does not need heavy configuration: just add peers and create and start volume - 3 commands. Not harder than lvm. Of course, gluster has tons of options to fine tune cluster for various kinds of loads, which is not bad, but confusing for newbies. But if you are newbie, why you need to change defaults?

That's interesting. When I was looking at it, it seemed overwhelming. Do you have some pointers for the basics?

There's lots of work being done in Ceph-Docker to make ceph easier to configure.

On top of that, there's work going on to get Ceph easily deployable on Kubernetes.


Many thanks, that looks doable. I'll give it a try.

Torus is also definitely single mount.

Watch out, next we'll be seeing CoreOS writing their own encryption protocols. As someone that's designed and built storage clusters from the ground up - take it from me: storage is not as easy as it seems if you care about your data and performance at scale. The number of people I've seen using CoreOS and then moving away from it is quite alarming, I feel like CoreOS will become the Ubuntu of the container world.

They're all-in on Kubernetes, which is quite apparent given (a) who's funding them and (b) product direction, including this one. They don't have the resources or backing to go after Docker; notice that changed? Docker has something like 15x the funding and is pulling away in a lot of ways, so CoreOS hitched on the Kubernetes wagon.

Better bet for destiny: Google is letting them build stuff off the books by just funding them, and at some point they'll get quietly bought with all their stuff added to Kubernetes, and that'll be that.

+1 for the assertion about the destiny of CoreOS.

So a userspace fs, I'm guessing this will use fuse to actually expose a POSIX fs? Small sync writes will absolutely kill performance and will require all sorts of hacks like glusterfs has had to implement due to the amount of context switches. CoreOS really should have added whatever needed to ceph, filesystems are not something you just hack together overnight.

We had a POSIX interface early on (via FUSE), but decided to expose a a block storage interface first instead. This is not a "filesystem", it's a storage abstraction, and we've spent more than 6 months on it. Seems like a lot of folks are quickly jumping to conclusions, which is expected from "the internet", but I would have hoped for better from HN.

6 months is overnight in terms of a filesystem, there are literally millions of dollars of work that was done with glusterfs and probably a magnitude less man hours. I understand coreos wants.to innovate but what does torus bring to the table besides negatives and NIH?

> but I would have hoped for better from HN

Don't worry about it. Some guys just mistake their experience with traditional storage systems as meaningful here.

Funny, just this week I was looking for an easy solution to cluster storage for containers.

I found SXCluster and posted it here https://news.ycombinator.com/item?id=11812566 - not directly geared towards containers, but very easy to setup (no need for etcd and the like).

I just hope the performance is visibly better than of gluster. Especially for small files.

are there any performance metrics for Tarus yet?

You could also check out EMC's Unity midrange storage at the EMC Store. http://bit.ly/1SIZ9N7

How is this different than Tahoe-LAFS?

Is it possible that Facebook haystack will be open source?

I love that I know who this is at CoreOS by the choice of username. Pick better throwaways if you're going to blow up a thread like this. (ideal0227 upthread is also a CoreOS employee and one of the etcd developers.)

On topic, yes. It is quite trivial to lose data with etcd and pretty much everybody I know who runs it has experienced problems. Try backing it up and restoring. etcd is compsci great yet operationally burdensome; it is extremely difficult to operate, tends to make assumptions and explain to you how to operate your fleet, and the developers are not very receptive to these and other operational concerns. This stems from CoreOS culture, to be quite clear -- CoreOS seeks to eliminate operations as a discipline and replace it with software, and tends to disregard or devalue operational concerns as a result.

Launching down the storage path with Go software and disregarding several decades of operational experience in the industry on how to do distributed storage correctly (not to mention the "looking down" on etcd-related infrastructure decisions from this anonymous employee) should indicate to you that I'm not making this up, despite how it may appear.

90% of folks I know on etcd, I'd estimate, have (a) reverted to Zookeeper or (b) moved on to Consul. It is the single piece of software that is holding Kubernetes back, too, and it's fairly obvious from roadmap direction that Kubernetes is now the etcd customer. Plan accordingly.

> I love that I know who this is at CoreOS by the choice of username

You can't comment like this here. It's a breach of civility and took the conversation in a needlessly personal direction. Please don't do this again.

We detached this subthread from https://news.ycombinator.com/item?id=11818589 and marked it off-topic.

Cockroachdb is built on top of etcd. So far seems to work very well for their product.

I think this whole thread is a bit harsh and unfair to CoreOs and their solutions.

I think that reflects the complexity of a storage product more than CoreOS itself. I use CoreOS. I like it. I love the update mechanism, and the tight focus on running as much as possible in containers.

But I also have spent enough time using it to run across certain warts, and Etcd clusters refusing to start without manual intervention etc. have been a frequent enough problem that when they go after something incredibly hard such as distributed storage, and it's relying on Etcd, I see that as a scary combination.

Disclaimer: I work on etcd at CoreOS.

CockroachDB is built on top of etcd/raft package, not etcd itself.

While we're talking about who is who -- the parent poster is Jed Smith; he's a sysadmin and disgruntled former CoreOS employee. I'd take what he says with more than a grain of salt.

Personal attacks are not allowed on HN, nor single-purpose accounts. Novelty accounts for personal attacks are right out, so we've banned this one.

It's threads like these where you start to wonder just how many marketing teams are arguing with each other in the comments. I take the FUD with a grain of salt: there is a lot of financial incentive to create FUD, where there is financial incentive and little to no regulations a market will naturally arise. It's going to get worse as the bubble pops and companies become more and more desperate.

I don't think it's marketing teams. There are at least a half dozen CoreOS employees/contributors commenting here, plus a couple of known allies. On the "other side" there's Bryan, me, and one person who seems to be an ex-employee. AFAIK none of them are in marketing, or even in coordination with marketing. Certainly, I often get flak from my company's official mouthpieces for saying things that conflict with their preferred talking points. They wish they could control what I say.

Mostly I think this is a matter of people naturally standing up for their friends and colleagues, which is a wonderful thing, vs. people who have specific concerns about The Right Way to do either technical or non-technical things. No coordination or collusion is necessary. You can see exactly the same thing happen for every company or project that's discussed here, or on Twitter, or wherever. Try criticizing a YC portfolio company some time. Not all of the people arguing with you will admit their affiliations, and of course the downvotes are all anonymous anyway. Just remember, the more skin someone has in the game, the more tempted they'll be to cross that vague line into astroturf. All the rest of us can do is admit our affiliations and biases, and hope that people will get their heads out of the ad hominem gutter enough to reach rational conclusions about the facts being presented.

Indeed. The upvoting of bcantrill seems to be devoid of understanding that he has been waging a personal vendetta against CoreOS for some time now, and his comments should be understood in that context.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact