Take it from someone who has been involved in both highly durable local filesystems and highly available object storage systems: this is such a hard, nasty problem with so many dark, hidden and dire failure modes, that it takes years of running in production to get these systems to the level of reliability and operability that the data path demands. Given that (according to the repo, though not the breathless blog entry) its creators "do not recommend its use in production", Torus is -- in the famous words of Wolfgang Pauli -- not even wrong.
Whereas, many companies seem to be doing the opposite in their work on these cloud filesystems. They don't build on proven, already-OSS components that have been battle-tested for a long time. They lack features and wisdom from prior deployments. They duplicate effort. They are also prone to using popular components, languages, whatever with implicitly-higher risk due to fact they were never built for fault-tolerant systems. Actually, many of them often assume something will watch and help them out in time of failure.
If it's high-assurance or fault-tolerance, my manta is "tried and true beats novel and new." Just repurpose what's known to work while improving its capabilities and code quality. Knocks out risks you know about plus others you don't since they were never documents. Has that been your experience, too? Should be a maxim in IT given how often problem plays out.
The HPC world can be thought of as an alternative universe, one where dinosaurs came to some sort of lazy sentience instead us mammals. HPC RAS (reliability, availability, serviceability) requirements and demands are profoundly different and profoundly lower from what enterprise or even SMB (small & medium sized businesses) would find acceptable. There are or were interesting ideas in GPFS but that was long ago, far away, and sterile besides - not that it stops IBM squeezing revenue from locked-in customers.
Some people in HPC do realize their insular state, but mostly there are hopes for things to be done better that were done better earlier (e.g. "declustered RAID") in the "real" world. Their ongoing hype is for yet-another-meta-filesystem built with feet of straw upon a pile of manure.
The likes of Google infrastructure, are built instead on academic storage R&D, which has been a great source of ideas and interesting test implementations - log structured file systems, distributed hash tables, etc. There is some small overlap, for example in the work of Garth Gibson, but it's not a joint evolution at all.
They developed in parallel and separately but there's lots of similarities. Many startups in HPC were taking HPC stuff to meet cloud-style requirements for years in terms of management, reliability, and performance. That it wasn't up to the standards of businesses is countered by how many bought those NUMA machines and clusters for mission-critical apps. A huge chunk of work in Windows and Linux in 90's and early 2000's went to trying to match their metrics in order to replace them on the cheap with x86 servers and Ethernet. So, yes, HPC was making stuff to meet tough, business needs that commercial servers and regular stacks couldn't compare to. That's despite number-crunching supercomputers being their main market.
Supercomputer system MTBF numbers aren't high at all. Supercomputer storage is fault tolerant only by fortuitous accident and often requires week-long outages with destroyed data for software upgrades. These systems are built with the likes of dubious metafilesystems (I'm talking to you, Lustre) sitting on top of shaky file systems on top of junk-bin HW RAID systems almost guaranteed to corrupt your data over time.
I think your overall statement is valuable with the global search/replace of "enterprise" for "HPC" - airline reservations or credit card processing. That's all I'm saying - HPC people (at least currently) are massively overrated as developers and as admins. Maybe it's all just a plan to extract energy from Seymour Cray's pulsar-like rotating remains?
Have you tried Ceph or Sector/Sphere? Lustre is known to be crap while Ceph gets a lot of praise and Sector/Sphere has potential for reliability with good performance. I think you may just be stuck with tools that suck. I'll admit it was about 10+ years ago when I was knowledgeable of this area. It could've all gone downhill since.
"That's all I'm saying - HPC people (at least currently) are massively overrated as developers and as admins. "
I'll agree with that. It's one of reasons field built so much tooling to make up for it. ;)
(I'm curious because I run an HPC installation with Lustre and NFS over XFS, and trying to think of the future. MBTF doesn't matter as much as raw speed while it actually runs.)
Lustre, as truly awful as it is, is a POSIX filesystem (or close enough for (literally) government work).
Redhat/Ceph only at the end of April, announced that POSIX functionality was ready for production. Personally, that's not when I'd choose to deploy production storage. Ceph object and nominally block have much more time in production.
If you need POSIX, trusting Ceph at this point is an issue unless, as you say, MTBF isn't a concern. You might want to try BeeGFS, a similar logical model but much simpler to implement, performance up to a very high level, and a record of reliable HPC deployments (as oxymoronic as that sounds).
If you can do with object then certainly exorcise Lustre from your environment in favor of Ceph (or try Scality if non-OS isn't an issue). Lustre's only useful as a jobs program anyway - keeping people occupied who'd otherwise be bodging up real software.
As someone who completely shares your Lustre sentiment, I can't fathom why Intel keeps pouring resources into it.
The problem (surprise!) is IBM. It's mature software, which means 21st Century Desperate IBM sees it as a cash cow - aggressively squeezing customers - and as something they can let their senior, expensive developers move on from - or lay them off in favor of "rightsourcing". You can certainly trust your data to it (unlike Lustre), but it'll be very expensive, especially on an ongoing basis, and the support team isn't going to know more than you by then. Also expect surprise visits from IBM licensing ninja squads looking for violations of the complex terms, which they will find.
As for Lustre, it brings to mind Oliver Wendell Holmes, Jr's, "Three generations of imbeciles are enough". I've been at least peripherally involved with it since 1999, with LLNL trying to strong-arm storage vendors into support. Someone should write a book following 16 years of the tangled Lustre trail from LLNL/CMU/CFS -> Sun -> Oracle -> WhamCloud -> OpenSFS -> ClusterStor -> Xyratex -> Seagate -> Intel (and probably ISIS too).
The answer to your question IHMO, is that Intel just isn't that smart. They're basically a PR firm with a good fab in the basement. What do they know about storage or so many other things? People don't remember when they tried to corner the web serving market back during the 1st Internet boom. They fail a lot, but until now had enough of a cash torrent coming that it didn't matter. They still do, of course, but there are inklings of an ebb.
Typical IBM, though. (shakes head)
Also, look at Sector/Sphere which was made by UDT designer for distributed, parallel workloads for supercomputing. It has significant advantages over Hadoop. It's used with high-performance links to share data between supercomputing centers.
That said, IPoIB should work just fine, and the main bottleneck currently is (Ethernet) latency. I'm running a couple of 1,6PB clusters (432 * 4TB) and can only get 20-60MBps on a single client with a 4kB block size, but got bored of benchmarking after saturating 5 concurrent 10Gb clients with a 4MB block size.
I do expect the RDMA situation to improve substantially over the next year or so, even if authentication will still be unsupported. The latter generally isn't a problem in HPC where stuff like GPFS lives (where you also have to trust every client). And they clearly want that market now that CephFS is finally deemed production ready.
It works with RDMA, has a POSIX layer, and is roughly equivalent to Lustre in performance in my tests, but 1° is very easy to setup (compared to Lustre) 2° has NFS actually working.
The second time I learned the painful lessons of storage was with Manta. Here again, we built on ZFS and reliable, proven "tried and true" technologies like PostgreSQL and Zookeeper. And here again, we learned about really nasty, surprising failure modes at the margins. These failure modes haven't led to data loss -- but when someone's data is unavailable, that is of little solace. In this regard, the data path -- the world of persistent state -- is a different world in terms of expectations for quality. That most of our domain thinks in terms of stateless apps is probably a good thing: state is a very hard thing to get right, and, in your words, tried and true absolutely beats novel and new. All of this is what makes Torus's ignorance of what comes before it so exasperating; one gets the sense that if they understood how thorny this problem actually is, they would be trying much harder to use the proven open source components out there rather than attempt to sloppily (if cheerfully) reinvent them.
"During the event, one of the shard databases had all queries on our primary table blocked by a three-way interaction between the data path queries that wanted shared locks, a "transaction wraparound" autovacuum that held a shared lock and ran for several hours, and an errant query that wanted an exclusive table lock."
That's with well-documented, well-debugged components doing the kinds of things they're expected to do. Still downed by a series of just three interactions creating a corner case. Three out of a probably ridiculous number over a large amount of time. Any system redoing and debugging components plus dealing with these interaction issues will fare far worse. Hence, both of our recommendations to avoid that risk.
Note: Amazon's TLA+ reports said they model-checkers for finding bugs that didn't show up until 30+ steps in the protocols. An unlikely set of steps that actually was likely in production per logs. Reading such things, I have no hope that code review or unit tests will save my ass or my stack if I try to clean-slate Google or Amazon infrastructure. Not even gonna try haha.
For those unfamiliar with the reference, there was an eye-opening report from Amazon engineers who'd used formal methods to find bugs in the design of S3 and other systems several years ago . I highly recommend reading it and then watching as many of Leslie Lamports talks on TLA+ and system specifications as possible.
Postgres doesn't have many weaknesses, but most of them relate to autovacuum.
Well, there's knowing it's autovacuum-related then there's the specific way it's causing a failure. First part was obvious. The rest took work.
"Postgres doesn't have many weaknesses, but most of them relate to autovacuum."
Sounds like that statement should be on a bug submission or something. They probably need to replace that with something better.
This creates big difficulties for both consistency and performance, and fixes for consistency make the performance worse (and vice versa).
Google's filesystem could use reed-solomon because they're append-only, making consistency a non-issue and performance can be fixed by buffering on the client side.
Secondly, this implies you are remapping the LBA (offsets) all the time, perhaps taking what would be sequential access and turning it into random? That sounds pretty painful.
Further more, it appears to me that etcd is now in the "data path" That is, in theory, each access could end up hitting etcd.
If so, I really would question why anyone would do this at all... this is not how any storage system is written.
If only one host at a time has access to a given virtual block device, there are some opportunities to buffer outgoing writes with a write-through cache. That might be the way to go if you explore erasure coding in the future.
Until now, I've thought quite highly of the CoreOS team. Now, not so much. They're playing a "freeze the competition" game instead of trying to compete on merit.
Really you come off as defensive. I don't understand how CoreOS has disrespected anyone by coming up with their own approach to the problem.
They're not writing blog posts or comments on HN, they're writing code.
Actually, the problem is that 90% of the code still remains to be written, while others (including me) have already done so. They've addressed only the very simplest part of the problem, not even far enough to show any performance comparisons, in a manner strongly reminiscent of Sheepdog (belying your "own approach" claim). That's a poor basis from which to promise so much. It's like writing an interpreter for a simple programming language and claiming it'll be a full optimizing C++ compiler soon. Just a few little pieces remaining, right?
It's perfectly fine for them to start their own project and have high hopes for it. The more the merrier. However, I have little patience for people who blur the lines between what's there and what might hypothetically exist some time in the future. That's far too often used to stifle real innovation that's occurring elsewhere. Maybe it's more common in storage than whatever your specialty is, but it's a well known part of the playbook. It's important to be crystal clear about what's real vs. what's seriously thought out vs. what's total blue-sky. Users and fellow developers deserve nothing less.
torusctl -C $ETCD_IP:2379 init
./torusd --etcd 127.0.0.1:2379 --peer-address http://127.0.0.1:40000 --data-dir /tmp/torus1 --size 20GiB
> gluster peer probe ...
> gluster volume create ...
> gluster volume start ...
qemu-img create -f qcow2 f.img 1G
modprobe nbd || true
qemu-nbd -c /dev/nbd0 f.img
mount /dev/nbd0 k
killall -KILL qemu-nbd
qemu-img create -f qcow2 f.img 1G
modprobe nbd || true
qemu-nbd -c /dev/nbd0 f.img
mount /dev/nbd0 k
killall -KILL qemu-nbd
 Longer answer: https://news.ycombinator.com/item?id=2287033
Sun should be resurrected now given that risc is leading the way and build everything on top of ARM! :). if only.
I was thinking mostly of the eternal buzzkill that ZFS is only usable through indirect means.
a) Build a distributed OS
b) Build a distributed scheduler (Fleet)
c) Build a distributed key value system (etcd)
d) Build a new container engine (Rocket)
e) Build a network fabric (Flannel)
f) Now embark on building a brand new distributed storage system.
Holy cow that's some goal list. Sounds like something my kids would make up for their Christmas wish list.
I really don't get how their board and investors let them get away with such a childish imagination.
Each one of those is a company effort on its own.
However CAS and immutable file systems aren't that common.
It's not difficult, it's actually impossible. And any system relying on the reliability of that would be broken by design.
> let alone string that into a distributed system that must make CAP tradeoffs
No. Distributed systems and "CAP tradeoffs" actually exist to solve the problem above with predictable certainty.
It is impossible to reliably store something in your local storage. It is possible to achieve some probability of data retention and availability in a distributed system.
An open source OS company just blogged about a new open source project that they are putting resources into. They did not release a commercial product that competes with any existing storage solution.
How exactly would one expect a new project to be announced? Tell me more about how far away from done you think they are.
Sorry if that's a bit sarcastic, but seriously wouldn't it be nice if these comments where more along the lines of: "Interesting new project, here's some issues we've run into vis a vis reliable storage that you should look out for..."
Plus all the arguments about local storage failure being hard are moot because this is a distributed solution and the hardship of local storage reliability can be abstracted away almost completely.
For me, this is surely systems work.
Some things that need to be solved sooner or later: data replication so that N faults of X entities are protected against (X can be disks, enclosures, racks, data centers, regions, ..), recovery from failed disks, scrubbing, data management, backups, some kind of storage orchestration and centralized management.
If you look at Ceph which IMHO represents the state of the art in software defined storage, it took many years to get it to a point that it was usable. I hate to be cynical but in this case I would be surprised if CoreOS can pull this of. I wish them all the best though and would be happy for them if I am wrong.
Which is why I'm rather surprised that these systems continue to be released without any formal specification or model checking. It's hard to get these things right... worth the effort, IMO, to create formal specifications.
I think the post outlines quite well why they chose to build a new distributed storage layer. From a product perspective it makes a lot of sense for them to have an offering in this space. Also, the fact that it's a hard problem is what makes it a compelling product offering. You can't make much money solving easy problems.
If they can solve the hard problem, yes. Which I believe is unlikely.
Persistent storage is a "hard problem" Really?
Don't we have distributed data storages precisely because it's impossible to guaranty persistence locally? It's kind of a way to not bother trying to solve the impossible, but to achieve some guarantees on a different level.
They're aiming to do a lot more than that. ;)
When you say Ceph is "state of the art in software defined storage", are you including performance, or are there advantages in features or reliability where you think it outclasses the competition?
tl;dr this is a new OSS distributed storage project that is written in Go and backed by etcd for consistency. The first project built on top of Torus is a network block device that can be mounted into containers for persistent storage. It also includes integrations out of the box for Kubernetes "flex volumes". Get involved! :)
Mounting containers to a distributed block system is an anti-pattern. This is going to go poorly.
Ceph has had some really smart people working on distributed block for a lot of years, and they still have significant issues. It's not because they're dumb, it's because performant, scalable, and available distributed block is either hard or impossible.
Am I understanding correctly that this is a file-based API? Distributing a POSIX filesystem effectively is very challenging, particularly since most applications that use them aren't written with CAP in mind; they don't expect a lot of basic operations to block for extended periods and fail in surprising ways, and they very often perform poorly when operations that are locally quick end up much slower over a network.
To be concrete:
> Today’s Torus release includes manifests using this feature to demonstrate running the PostgreSQL database server atop Kubernetes flex volumes, backed by Torus storage.
It will be interesting to see how well this performs and how it behaves in the face of single-node failures and network congestion.
Backed by etcd. Today etcd is a well tested and widely used consistent store with users like Kubernetes, Flannel (now Canal), Fleet, SkyDNS and many others. Building something like etcd requires tons of testing and etcd is becoming the solid go to for this category of distributed problems.
Easy to work on code base. Building an OSS community around a complex technology is really hard. It shouldn't be underestimated that a code base that can be built on OS X, Linux, and Windows will have an advantage in getting people involved and maturing the project.
Designed for other storage application. Torus is backed by a distributed storage interface that is exposed via gRPC (with the opportunity for language bindings in lots of languages). Our hope is to let people build write-ahead log systems, object systems, or filesystems on top of this abstraction.
Flexibility built into the storage layer. Over time we want people to implement different hash maps or explicit configuration of storage layout. This is an abstraction that we have today and intend to build out over time with feedback from users.
Happy to answer more questions. This is a young project and we look forward to working with people and hearing about what they want out of a storage system. Overall, we want to build something with folks that is easy to work on and backs itself with technologies that are getting good operational understanding in the "Cloud Native" space.
To me at least, having it "backed by etcd" is a big red flag, not a feature.
So, this company is trading away stuff like that for custom components and Etcd? And for a component focusing on integrity and availability? Huh?
I've managed to lose data with Etcd
Being built atop something else doesn't mean it inherits all the stability of the underlying system; it's an upper bound not a lower bound.
Thinking about it further, I think I can guess at what's going on here. The key observation is that there's no sync() at the block device level. It's a filesystem operation; block devices don't see it. Sure, there are queue flags and FUA and such, but those are different (and I'm not sure any of those exist in NBD). Where is this sync() path? I'm guessing it's internal on the data servers, to deal with data that's being buffered there. With both replication and erasure coding, correct recovery requires exact knowledge of what has been fully written where, and that's the kind of metadata I suspect is being put in etcd. There's not even necessarily anything wrong with it, unless updating that information only on sync() means that supposedly durable writes since the last (unpredictable to the client) sync could be lost on failure. I hope that's not the case.
Maybe I'll find time, in the midst of my work on an actual production-level distributed filesystem, to look at the code and see if my guess is correct.
> All writes up until a sync are effectively anonymous until a sync().
"Anonymous" means nothing in this context. Do you mean non-durable? First, please use the correct term. Second, if that is what you mean then you're probably doing it wrong. File writes are allowed to be asynchronous (unless O_SYNC and friends). Block device writes are expected to be synchronous, or at least to preserve order. This is exactly the kind of thing that needs to be thoroughly thought out before code is even written, and that thinking should be spelled out somewhere for people to help make sure all the nasty corner cases are covered. Your cart is way before your horse.
For Etcd I'm of two minds. I like its simplicity. On the other hand, I have had far more problems with Etcd than I've ever had with e.g. Consul (though I dislike Consuls "and the kitchen sink" approach), up to and including data loss and more than once having to re-initialise clusters. I'm a couple of years of no further problems at least a way from trusting Etcd with my data.
For this? They have to have very compelling stories for why they couldn't e.g. modify Ceph or Gluster to achieve whatever it is that they want. And it's no clear what they want. Gluster for example is built as a stack of pluggable translators that is relatively easy to extend.
Even if they have a credible and compelling story for why we need something else, it will take years of production use by others before I'd trust any data to a new distributed file system.
Gluster and Ceph have those years. I've had a Gluster volume running without maintenance or data loss for more than five years. Even then it took several years before I truly started to trust it.
Gluster had many issues too when it was a young project, but they were fixed many years ago. I don't doubt that Etcd too will become solid enough for me to trust, but looking at part of the docs  we still have "gems" such as "Permanent Loss of Quorum Requires New Cluster" (you don't lose your data to that one, but it's painful nevertheless).
And yes, I've run into that one. Not "permanent" as in "we could never recover the machines in question", but "permanent" as in "we're not going to be able to get this sorted fast enough, so let's fail everything over". Everything else we run can handle that scenario easily. We know it means consistency issues, but if you run into that situation we have taken the decision that they are acceptable and will be resolved later if necessary. The last thing I need in a scenario like that is for some critical component to just refuse to run without requiring lots of manual intervention.
All we'd like would be to be able to say "ok, drop all the other members and pretend all is well, and yes we know what we're asking". There are a number of gotcha's like that with Etcd that sounds find until they bite you in the ass.
They'll resolve these things eventually, but storage is the very last place I'd be willing to take risks with it. So until they have years of track record of running flawlessly, with less hassle than Gluster or Ceph, and substantial advantages, I'll see it as the dangerous choice.
How many separate data centres do you want me to host in? Two is certainly insufficient, as if the data centre with the majority of Etcd nodes now falls of the network, you're SOL.
Three? Think you can't have two fall of face of the earth the first time? Been there. Redundant connections doesn't always help if there are major external connectivity issues and you don't have multi-million dollar budgets for connectivity alone.
Most small to mid size companies do not have the budget to run a system that is sufficiently redundant to be able to guarantee that they won't sooner or later be in a situation where either 1) the most viable partition is the minority, or 2) the Etcd cluster is centralised enough that too large parts of the cluster can loose access to it. When fighting large scale failures is the last time where you want to have to fight your tools to be able to convince them that you know what you're doing.
For those kind of situations, Etcd's lack of support for more fine-grained control of consistency or for intentionally breaking apart the cluster and letting them continue to operate independently when you know what you're doing basically makes it unsuitable for running multi-data centre clusters for anything important.
Which basically means any tools that assumes a single underlying Etcd cluster is appropriate as storage, for, say, a cluster scheduler (cough) becomes an inappropriate tool for those kind of setups.
And it generally seems that, when they say "composable modular tools", one should read "includes dependencies on the rest of our stack". This isn't intended as harshness. It is just that every time a tool from COS seems interesting enough to look in to, it turns out that I wouldn't be investing in a tool, it would be investing in an ecosystem that duplicates a ton of what I'm already managing.
I write this not as storage religion (of which there's far too much), but to warn away those who haven't experienced the many kinds of data (and stomach lining) loss that come with being a Lustre admin.
I swore I added it to the Wikipedia links. I'll have to do it again, I guess.
(disclaimer: I'm an engineer there...)
I thought Elasticsearch was open-sourced, and the GitHub repo says Apache v2. What am I missing here?
* Single reader/writer. If you try and set up more, Bad Things happen (even if it's single-writer, you don't get read-after-write consistency).
* It sure sounds like network partitions will allow all kinds of badness.
* If copy 1 goes down, you can keep operating on copy 2, then lose copy 2, have copy 1 come back up, and then warp back in time? Maybe that's prevented because of append-only and you just lose the data entirely because it was only replicating to one node due to the "temporary" failure.
Most egregiously, the Architecture description of an Inode implies that persisting a write-to-disk requires persisting the "INode". INodes are persisted in etcd.
Which means your entire cluster's write-to-disk throughput is limited to what you can push through a Raft consensus algorithm.
Look, there are all kinds of reasons one could legitimately decide that none of the existing scalable block storage systems satisfy your use case. Maybe containers really are different enough from VMs. But the blog post claims that there just aren't any solutions; the research papers cited in the Documentation page are mostly old and are about systems in a very different part of this sub-space; and what developer documentation exists does not encourage me that this is a good idea.
Granted, I've been working on Ceph for 7 years and am a bit of a snob as a result.
Good work on Ceph, by the way. I've been following your work since it was a PhD if I remember correctly.
One example: Both Gluter and Ceph have erasure-coded storage. Gluster's looks just like the replicated storage, only it involves more nodes and less overhead. Ceph's is severely limited in comparison: it's append-only, it doesn't allow use of Ceph's omap kv store, "object class" embedded code, etc. The reason is because distributed EC is subject to the same kind of problem as the RAID5 write hole: if a Gluster client submits an overwrite to 3 of a 4+2 replica group and then crashes, the overwritten data is unrecoverable and the newly-written data never made it.
Torus won't hit that particular issue because it is log-structured to begin with, which has all kinds of advantages. But garbage collection is really hard! Much harder than seems remotely reasonable! Getting good coalescing and read performance is really hard! Much harder than seems remotely reasonable! There's one big existing storage log-structured distributed storage system which has discussed this publicly: Microsoft Azure. They have a few papers out which hint at the contortions they went through to make block devices work performantly — and Azure writes first to 3-replica and then destages to the log! They still had performance issues!
https://github.com/coreos/torus/blob/master/Documentation/re... points to a bunch of HDFS research and replacements; HDFS is designed for the opposite (large files with high bandwidth and nobody-cares latency) of what I presume Torus is targeting (high IO efficiency, low latency). Mostly the same for the Google papers they cite. There's no mention of Azure's storage system papers, nor of Ceph, nor anything about the not-paper-publishing-but-blogging stuff from Gluster or sheepdog; nor from academic research into VM storage systems (there's tons about this!).
Can they fix a bunch of this? Sure. But the desires they list in eg https://github.com/coreos/torus/blob/master/Documentation/ar... go towards making things worse, not better. They aren't talking about how etcd can be in the allocator path but not the persistence path and how every mount should run a repair on the data to deal with out-of-date headers. They talk about adding in filesystems, but not any way of supporting read-after-write (which is impossible with the primitives they describe so far, and really hard in a log-structured system without synchronous communication of some kind). They discuss network partitions between the storage nodes, and between the client and etcd; they don't discuss clients keeping access to the disks but losing it to etcd.
 Using etcd for allocation would be a reasonable choice, but putting it in the persistence path is now. Right now a database in your container would require two separate write streams to do an fsync:
1) data write. It doesn't say in docs and I didn't look to see if replication is client or server-driven, but assuming sanity the network traffic is client->server1->server2->server1->client, with a write-to-disk happening before the server2->server1 step.
2) etcd write. Client->etcd master->etcd slaves (2+)->etcd master->client, with a disk write to each etcd process' disk before the reply to etcd master.
This is a busy-neighbor, long-tail latency disaster waiting to happen.
>haven't figured out consistency yet
I don't recall this being the case, but I'm not the authority on the matter.
>I presume Torus is targeting (high IO efficiency, low latency)
The best-possible performing storage solution is _definitely_ not our primary goal, though we'll take it where we can get it. The most important goals for the project are ease of use, ease of management, flexibility, and correctness when the use-case desires it. Please note that the block device interface is only one of many planned. The underlying abstraction was designed (and will be improved) to support other situations.
>Using etcd for allocation would be a reasonable choice, but putting it in the persistence path is now. Right now a database in your container would require two separate write streams to do an fsync
In Torus today (with the block device interface specifically), and with the caveat that I'm not the authority, so I may be slightly wrong, calling sync(), fsync(), and friends result in what I think you would consider an "allocation". Writes happen against a snapshot of the file (in this block storage case, the block volume is the "file"), and then a sync() makes those changes visible as the "current" version. Syncs hit etcd, writes do not.
I would really encourage you to submit feedback like this in GitHub issues. The project is still in _very_ early stages, and legitimate feedback can actually make a difference.
I'd be happy to work on a specification with the community.
This is one of those classic "release too early or release too late" sort of things that got cut in preference to get early feedback and community participation.
The initial goals that are in this release:
1. An abstraction for replicated storage between machines in a cluster.
2. A block device "application" on top that an ext4 filesystem could be run on. This is like an EBS.
In the future someone might build other applications like "object storage" or filesystems and we would love to get the feedback on the API to do that. But, that isn't in the initial goals and we will be focusing on the storage and block layers for the time being.
But when it comes to container storage I'm confused.
The 12 factor app methodologies work very well for every service I have written and supported.
Processes/containers get ephemeral storage. When you need persistence you delegate to a stateful service, generally Postgres and S3.
Id much rather write my apps around these simple and well understood constraints then depend on magic file systems.
However, when your running in your own datacenter, you don't have the luxury of using Amazon's hosted products, so Torus exists to help provide some building blocks to build your own S3, or potentially even RDS.
All other solutions are not practical. GCE and EBS are only single mount. iSCSI is unsupported in the cloud. Leaving only Ceph and Glusterfs, both mentioned here, but needing heavy configuration.
On top of that, there's work going on to get Ceph easily deployable on Kubernetes.
Better bet for destiny: Google is letting them build stuff off the books by just funding them, and at some point they'll get quietly bought with all their stuff added to Kubernetes, and that'll be that.
I found SXCluster and posted it here https://news.ycombinator.com/item?id=11812566 - not directly geared towards containers, but very easy to setup (no need for etcd and the like).
Don't worry about it. Some guys just mistake their experience with traditional storage systems as meaningful here.
On topic, yes. It is quite trivial to lose data with etcd and pretty much everybody I know who runs it has experienced problems. Try backing it up and restoring. etcd is compsci great yet operationally burdensome; it is extremely difficult to operate, tends to make assumptions and explain to you how to operate your fleet, and the developers are not very receptive to these and other operational concerns. This stems from CoreOS culture, to be quite clear -- CoreOS seeks to eliminate operations as a discipline and replace it with software, and tends to disregard or devalue operational concerns as a result.
Launching down the storage path with Go software and disregarding several decades of operational experience in the industry on how to do distributed storage correctly (not to mention the "looking down" on etcd-related infrastructure decisions from this anonymous employee) should indicate to you that I'm not making this up, despite how it may appear.
90% of folks I know on etcd, I'd estimate, have (a) reverted to Zookeeper or (b) moved on to Consul. It is the single piece of software that is holding Kubernetes back, too, and it's fairly obvious from roadmap direction that Kubernetes is now the etcd customer. Plan accordingly.
You can't comment like this here. It's a breach of civility and took the conversation in a needlessly personal direction. Please don't do this again.
We detached this subthread from https://news.ycombinator.com/item?id=11818589 and marked it off-topic.
I think this whole thread is a bit harsh and unfair to CoreOs and their solutions.
But I also have spent enough time using it to run across certain warts, and Etcd clusters refusing to start without manual intervention etc. have been a frequent enough problem that when they go after something incredibly hard such as distributed storage, and it's relying on Etcd, I see that as a scary combination.
CockroachDB is built on top of etcd/raft package, not etcd itself.
Mostly I think this is a matter of people naturally standing up for their friends and colleagues, which is a wonderful thing, vs. people who have specific concerns about The Right Way to do either technical or non-technical things. No coordination or collusion is necessary. You can see exactly the same thing happen for every company or project that's discussed here, or on Twitter, or wherever. Try criticizing a YC portfolio company some time. Not all of the people arguing with you will admit their affiliations, and of course the downvotes are all anonymous anyway. Just remember, the more skin someone has in the game, the more tempted they'll be to cross that vague line into astroturf. All the rest of us can do is admit our affiliations and biases, and hope that people will get their heads out of the ad hominem gutter enough to reach rational conclusions about the facts being presented.