
Torus: A distributed storage system by CoreOS - philips
https://coreos.com/blog/torus-distributed-storage-by-coreos.html
======
bcantrill
It's hard to take this seriously: storage is an excruciatingly hard problem,
yet this cheerful description of a nascent and aspirational effort seems
blissfully unaware of how difficult it is to even just reliably get bits to
and from stable storage, let alone string that into a distributed system that
must make CAP tradeoffs. There is not so much of a whisper as to what the data
path actually looks like other than "the design includes the ability to
support [...] Reed-Solomon error correction in the near future" \-- and the
fact that such an empty system hails itself as pioneering an unsolved problem
in storage is galling in its ignorance of prior work (much of it open source).

Take it from someone who has been involved in both highly durable local
filesystems[1] and highly available object storage systems[2][3]: this is such
a hard, nasty problem with so many dark, hidden and dire failure modes, that
it takes _years_ of running _in production_ to get these systems to the level
of reliability and operability that the data path demands. Given that
(according to the repo, though not the breathless blog entry) its creators "do
not recommend its use in production", Torus is -- in the famous words of
Wolfgang Pauli -- not even wrong.

[1] [http://dtrace.org/blogs/bmc/2008/11/10/fishworks-now-it-
can-...](http://dtrace.org/blogs/bmc/2008/11/10/fishworks-now-it-can-be-told/)

[2] [http://dtrace.org/blogs/bmc/2013/06/25/manta-from-
revelation...](http://dtrace.org/blogs/bmc/2013/06/25/manta-from-revelation-
to-product/)

[3] [http://dtrace.org/blogs/dap/2013/07/03/fault-tolerance-in-
ma...](http://dtrace.org/blogs/dap/2013/07/03/fault-tolerance-in-manta/)

~~~
nickpsecurity
I totally agree with you. I also liked how they said the motivation is to make
Google infrastructure for everyone else. How did Google do this? They
basically imitated and improved on clustered filesystems developed in HPC.
There were a lot of lessons to learn for emerging cloud market in all tooling
done in HPC. Some was FOSS.

Whereas, many companies seem to be doing the opposite in their work on these
cloud filesystems. They don't build on proven, already-OSS components that
have been battle-tested _for a long time_. They lack features and wisdom from
prior deployments. They duplicate effort. They are also prone to using popular
components, languages, whatever with implicitly-higher risk due to fact they
were never built for fault-tolerant systems. Actually, many of them often
assume something will watch and help them out in time of failure.

If it's high-assurance or fault-tolerance, my manta is "tried and true beats
novel and new." Just repurpose what's known to work while improving its
capabilities and code quality. Knocks out risks you know about plus others you
don't since they were never documents. Has that been your experience, too?
Should be a maxim in IT given how often problem plays out.

~~~
bcantrill
Yes, that's absolutely been my experience -- and even then, when it comes to
the data path, you will likely find new failure modes in "tried and true" as
you push it harder and longer and with the bar being set at absolute
perfection. I have learned this painful lesson twice: first, with Fishworks at
Sun when we turned ZFS into a storage appliance -- and we learned the painful
difference between something that seems to work all of the time and something
that _actually_ works all of the time. (2009 was a really tough year.[1]) And
ZFS was fundamentally sound (and certainly had been running for years in
production) before we pushed it into the broad enterprise storage substrate:
the bugs that we found weren't ones of durability, but rather of deeply
pathological performance. (As I was fond of saying at the time, we never lost
anyone's data -- but we has some data take some very, very long vacations.) I
shudder to think about building a data path on much less proven components
than ZFS circa 2008, let alone building a data path seemingly in total
ignorance of the mechanics -- let alone challenges -- of writing to persistent
storage.

The second time I learned the painful lessons of storage was with Manta.[2]
Here again, we built on ZFS and reliable, proven "tried and true" technologies
like PostgreSQL and Zookeeper. And here again, we learned about really nasty,
surprising failure modes at the margins.[3] These failure modes haven't led to
data loss -- but when someone's data is unavailable, that is of little solace.
In this regard, the data path -- the world of persistent state -- is a
different world in terms of expectations for quality. That most of our domain
thinks in terms of stateless apps is probably a good thing: state is a very
hard thing to get right, and, in your words, tried and true absolutely beats
novel and new. All of this is what makes Torus's ignorance of what comes
before it so exasperating; one gets the sense that if they understood how
thorny this problem actually is, they would be trying much harder to use the
proven open source components out there rather than attempt to sloppily (if
cheerfully) reinvent them.

[1] [http://dtrace.org/blogs/bmc/2010/03/10/turning-the-
corner/](http://dtrace.org/blogs/bmc/2010/03/10/turning-the-corner/)

[2] [https://github.com/joyent/manta](https://github.com/joyent/manta)

[3] [https://www.joyent.com/blog/manta-
postmortem-7-27-2015](https://www.joyent.com/blog/manta-postmortem-7-27-2015)

~~~
nickpsecurity
That was a pretty humble and good read. I don't think I'd have seen the
autovacuuming issue coming. Actually, this quote is a perfect example of how
subtle and ridiculous these issues can be:

"During the event, one of the shard databases had all queries on our primary
table blocked by a three-way interaction between the data path queries that
wanted shared locks, a "transaction wraparound" autovacuum that held a shared
lock and ran for several hours, and an errant query that wanted an exclusive
table lock."

That's with well-documented, well-debugged components doing the kinds of
things they're expected to do. Still downed by a series of just three
interactions creating a corner case. Three out of a probably ridiculous number
over a large amount of time. Any system redoing and debugging components plus
dealing with these interaction issues will fare far worse. Hence, both of our
recommendations to avoid that risk.

Note: Amazon's TLA+ reports said they model-checkers for finding bugs that
didn't show up until 30+ steps in the protocols. An unlikely set of steps that
actually was likely in production per logs. Reading such things, I have no
hope that code review or unit tests will save my ass or my stack if I try to
clean-slate Google or Amazon infrastructure. Not even gonna try haha.

~~~
nl
Anyone who's worked with Postgres at scale would guess autovacuum.

Postgres doesn't have many weaknesses, but most of them relate to autovacuum.

~~~
nickpsecurity
"Anyone who's worked with Postgres at scale would guess autovacuum."

Well, there's knowing it's autovacuum-related then there's the specific way
it's causing a failure. First part was obvious. The rest took work.

"Postgres doesn't have many weaknesses, but most of them relate to
autovacuum."

Sounds like that statement should be on a bug submission or something. They
probably need to replace that with something better.

~~~
anarazel
It's known to the postgres developers, and we are working on it. This specific
issue (anti-wraparound vacuums being a lot more expensive) should be fixed in
the upcoming 9.6.

~~~
nickpsecurity
Awesome! I already push Postgres and praise its team for the quality focus.
Just extra evidence in your favor. :)

------
joshuak
Good... Good... Let the hate flow through you.

An open source OS company just _blogged_ about a new open source project that
they are putting resources into. They did not release a commercial product
that competes with any existing storage solution.

How exactly would one expect a _new_ project to be announced? Tell me more
about how far away from done you think they are.

Sorry if that's a bit sarcastic, but seriously wouldn't it be nice if these
comments where more along the lines of: "Interesting new project, here's some
issues we've run into vis a vis reliable storage that you should look out
for..."

~~~
abrookewood
Yes, I have to agree. There's a little too much negativity for my liking,
especially from people who have a vested interest in downplaying the efforts
of others.

------
sysexit
I love CoreOS and they've done some super impressive engineering. But really,
a new storage system? Rewriting in Go and using etcd for central state
management makes things easier, but this is still a _hard_ problem.

Some things that need to be solved sooner or later: data replication so that N
faults of X entities are protected against (X can be disks, enclosures, racks,
data centers, regions, ..), recovery from failed disks, scrubbing, data
management, backups, some kind of storage orchestration and centralized
management.

If you look at Ceph which IMHO represents the state of the art in software
defined storage, it took many years to get it to a point that it was usable. I
hate to be cynical but in this case I would be surprised if CoreOS can pull
this of. I wish them all the best though and would be happy for them if I am
wrong.

~~~
ams6110
It's like any time we take a step forward in one area we have to reinvent the
last 50 years of computing to support it.

Persistent storage is a "hard problem" Really?

~~~
vidarh
Persistent storage isn't a hard problem. Distributed, well performing,
scalable and consistent storage is a hard problem.

~~~
dap
Actually, persistent storage is fairly hard in itself. Look at what ZFS does
to ensure data integrity in the face of phantom writes, dropped writes, bad
controllers, and other implicit, non-fatal failures.

~~~
vidarh
ZFS still also have the issue of having to perform well. You have a point, but
ZFS is still trivial compared to a proper distributed filesystem, and you
could achieve the same reliability much easier than ZFS if you sacrificed the
performance.

~~~
thinkersilver
The ClusterHQ guys behind the FlockerHQ found this out the hard way [0].
Initially Flocker was meant to provide a container data migration tool on top
of ZFS, now it is a front-end to more established storage systems like
Cinder,vSan,EBS,GPD and so on.

[0] [https://docs.clusterhq.com/en/latest/faq/#what-happened-
to-z...](https://docs.clusterhq.com/en/latest/faq/#what-happened-to-zfs-
support)

------
philips
The CoreOS team is excited to make this initial release and start
collaborating with folks that want to tackle distributed storage.

tl;dr this is a new OSS distributed storage project that is written in Go and
backed by etcd for consistency. The first project built on top of Torus is a
network block device that can be mounted into containers for persistent
storage. It also includes integrations out of the box for Kubernetes "flex
volumes". Get involved! :)

[https://github.com/coreos/torus](https://github.com/coreos/torus)

~~~
paulsutter
What's the current largest-scale deployment?

~~~
lobster_johnson
It's described as a prototype, so I'm pretty sure there's no deployment at
scale at all.

------
alrs
Storing container images is a great use case for an object store, not a block
store. They should get sucked down whole to ephemeral storage on the compute
nodes.

Mounting containers to a distributed block system is an anti-pattern. This is
going to go poorly.

Ceph has had some really smart people working on distributed block for a lot
of years, and they still have significant issues. It's not because they're
dumb, it's because performant, scalable, and available distributed block is
either hard or impossible.

~~~
sagichmal
This is about providing data volumes to containers, not about hosting
container images or mounting containers themselves.

------
dap
> At its core, Torus is a library with an interface that appears as a
> traditional file, allowing for storage manipulation through well-understood
> basic file operations. Coordinated and checkpointed through etcd’s consensus
> process, this distributed file can be exposed to user applications in
> multiple ways. Today, Torus supports exposing this file as block-oriented
> storage via a Network Block Device (NBD). We also expect that in the future
> other storage systems, such as object storage, will be built on top of Torus
> as collections of these distributed files, coordinated by etcd.

Am I understanding correctly that this is a file-based API? Distributing a
POSIX filesystem effectively is very challenging, particularly since most
applications that use them aren't written with CAP in mind; they don't expect
a lot of basic operations to block for extended periods and fail in surprising
ways, and they very often perform poorly when operations that are locally
quick end up much slower over a network.

To be concrete:

> Today’s Torus release includes manifests using this feature to demonstrate
> running the PostgreSQL database server atop Kubernetes flex volumes, backed
> by Torus storage.

It will be interesting to see how well this performs and how it behaves in the
face of single-node failures and network congestion.

~~~
philips
It isn't a POSIX filesystem API. It is like a single file or a big distributed
tape.

~~~
dap
Virtualized, network block devices have all the same problems I described --
even worse, because the abstraction coneys even less about what an application
is trying to do.

~~~
ibotty
No. The difference is, that with network block devices you are usually only
allow accessing the block device once. That's an easier problem than mapping
POSIX file system semantics!

------
xfactor973
I don't understand why this is being used over say librados + librbd from
Ceph. This seems like an awful lot of work to get the same functionality that
Gluster/Ceph already have. Is there something I'm missing here?

~~~
bogomipz
I kind of see this as a general trend where CoreOS is concerned they
reimplement everything in house, even when open source solution might already
exist - rkt, etcd, now this.

~~~
zsmith928
what is the existing alternative for storage in distributed container
environments?

~~~
gsmethells
Ceph, OpenStack Swift, GlusterFS, OrangeFS, Lustre.

[https://en.wikipedia.org/wiki/Clustered_file_system#Distribu...](https://en.wikipedia.org/wiki/Clustered_file_system#Distributed_file_systems)

[https://en.wikipedia.org/wiki/Object_storage](https://en.wikipedia.org/wiki/Object_storage)

~~~
pinewurst
If Lustre is the answer, you're probably asking the wrong question. Unless
that question involves short life-span, massively parallel swap-outs from one
weapons simulation to another.

I write this not as storage religion (of which there's far too much), but to
warn away those who haven't experienced the many kinds of data (and stomach
lining) loss that come with being a Lustre admin.

~~~
sitkack
[https://www.csc.fi/web/blog/post/-/blogs/the-largest-
unplann...](https://www.csc.fi/web/blog/post/-/blogs/the-largest-unplanned-
outage-in-years-and-how-we-survived-it)

------
gregsfortytwo
Okay, so from reading the developer docs at
[https://github.com/coreos/torus/Documentation](https://github.com/coreos/torus/Documentation),
I'm inferring that:

* Single reader/writer. If you try and set up more, Bad Things happen (even if it's single-writer, you don't get read-after-write consistency).

* It sure sounds like network partitions will allow all kinds of badness.

* If copy 1 goes down, you can keep operating on copy 2, then lose copy 2, have copy 1 come back up, and then warp back in time? Maybe that's prevented because of append-only and you just lose the data entirely because it was only replicating to one node due to the "temporary" failure.

Most egregiously, the Architecture description of an Inode implies that
persisting a write-to-disk requires persisting the "INode". INodes are
persisted in etcd. Which means your entire cluster's write-to-disk throughput
is limited to what you can push through a Raft consensus algorithm.

Look, there are all kinds of reasons one could legitimately decide that none
of the existing scalable block storage systems satisfy your use case. Maybe
containers really are different enough from VMs. But the blog post claims that
there just _aren 't_ any solutions; the research papers cited in the
Documentation page are mostly old and are about systems in a very different
part of this sub-space; and what developer documentation exists does not
encourage me that this is a good idea.

Granted, I've been working on Ceph for 7 years and am a bit of a snob as a
result.

~~~
zzzcpan
They accept data loss, yes, and haven't figured out consistency yet. But given
that it's an early prototype I would expect them to evolve and change things a
lot in a couple of years, possibly throwing away raft and etcd.

Good work on Ceph, by the way. I've been following your work since it was a
PhD if I remember correctly.

~~~
gregsfortytwo
"haven't figured out consistency yet" is the issue. Seemingly very small
decisions have huge impacts that you don't expect, and the state of the art is
far enough along that you don't get better than the existing solutions except
by exploiting newly-identified workload characteristics (or acceptable ways of
losing data, loosening consistency, etc based on the workload) that you've
planned out to make use of ahead of time.

One example: Both Gluter and Ceph have erasure-coded storage. Gluster's looks
just like the replicated storage, only it involves more nodes and less
overhead. Ceph's is severely limited in comparison: it's append-only, it
doesn't allow use of Ceph's omap kv store, "object class" embedded code, etc.
The reason is because distributed EC is subject to the same kind of problem as
the RAID5 write hole: if a Gluster client submits an overwrite to 3 of a 4+2
replica group and then crashes, the overwritten data is unrecoverable and the
newly-written data never made it.

Torus won't hit that particular issue because it is log-structured to begin
with, which has all kinds of advantages. But garbage collection is _really
hard_! Much harder than seems remotely reasonable! Getting good coalescing and
read performance is _really hard_! Much harder than seems remotely reasonable!
There's one big existing storage log-structured distributed storage system
which has discussed this publicly: Microsoft Azure. They have a few papers out
which hint at the contortions they went through to make block devices work
performantly — and Azure writes first to 3-replica and then destages to the
log! They still had performance issues!

[https://github.com/coreos/torus/blob/master/Documentation/re...](https://github.com/coreos/torus/blob/master/Documentation/research-
links.md) points to a bunch of HDFS research and replacements; HDFS is
designed for the opposite (large files with high bandwidth and nobody-cares
latency) of what I presume Torus is targeting (high IO efficiency, low
latency). Mostly the same for the Google papers they cite. There's no mention
of Azure's storage system papers, nor of Ceph, nor anything about the not-
paper-publishing-but-blogging stuff from Gluster or sheepdog; nor from
academic research into VM storage systems (there's _tons_ about this!).

Can they fix a bunch of this? Sure. But the desires they list in eg
[https://github.com/coreos/torus/blob/master/Documentation/ar...](https://github.com/coreos/torus/blob/master/Documentation/architecture.md)
go towards making things _worse_ , not better. They aren't talking about how
etcd can be in the allocator path but not the persistence path[1] and how
every mount should run a repair on the data to deal with out-of-date headers.
They talk about adding in filesystems, but not any way of supporting read-
after-write (which is impossible with the primitives they describe so far, and
_really hard_ in a log-structured system without synchronous communication of
some kind). They discuss network partitions between the storage nodes, and
between the client and etcd; they don't discuss clients keeping access to the
disks but losing it to etcd.

[1] Using etcd for allocation would be a reasonable choice, but putting it in
the persistence path is now. Right now a database in your container would
require two separate write streams to do an fsync:

1) data write. It doesn't say in docs and I didn't look to see if replication
is client or server-driven, but assuming sanity the network traffic is
client->server1->server2->server1->client, with a write-to-disk happening
before the server2->server1 step.

2) etcd write. Client->etcd master->etcd slaves (2+)->etcd master->client,
with a disk write to each etcd process' disk before the reply to etcd master.

This is a busy-neighbor, long-tail latency _disaster_ waiting to happen.

~~~
betawaffle
These are very good points, but also probably much more constructive as GitHub
issues, where they can be either answered or addressed. In the meantime,
hopefully I can talk to a few of these:

>haven't figured out consistency yet I don't recall this being the case, but
I'm not the authority on the matter.

>I presume Torus is targeting (high IO efficiency, low latency) The best-
possible performing storage solution is _definitely_ not our primary goal,
though we'll take it where we can get it. The most important goals for the
project are ease of use, ease of management, flexibility, and correctness when
the use-case desires it. Please note that the block device interface is only
one of many planned. The underlying abstraction was designed (and will be
improved) to support other situations.

>Using etcd for allocation would be a reasonable choice, but putting it in the
persistence path is now. Right now a database in your container would require
two separate write streams to do an fsync In Torus today (with the block
device interface specifically), and with the caveat that I'm not the
authority, so I may be slightly wrong, calling sync(), fsync(), and friends
result in what I think you would consider an "allocation". Writes happen
against a snapshot of the file (in this block storage case, the block volume
is the "file"), and then a sync() makes those changes visible as the "current"
version. Syncs hit etcd, writes do not.

I would really encourage you to submit feedback like this in GitHub issues.
The project is still in _very_ early stages, and legitimate feedback can
actually make a difference.

------
agentultra
Where's the formal specification of the system? If we want these distributed
systems to be reliable how can we be sure our algorithms and processes work if
we do not first model them?

I'd be happy to work on a specification with the community.

~~~
philips
We haven't gotten that far down the path. We would love the help. I believe
there are some issues related to documenting the architecture and failure
domains for the v0.1.1 release:
[https://github.com/coreos/torus/milestones/v0.1.1](https://github.com/coreos/torus/milestones/v0.1.1)

This is one of those classic "release too early or release too late" sort of
things that got cut in preference to get early feedback and community
participation.

~~~
subway
Coming from the company that raised hell about Docker's lack of a spec, this
is an interesting approach.

------
gsmethells
Is the goal to provide POSIX file access as NFS and AFS do? The mention of
object storage as a future direction feels like the early "the sky's the
limit" euphoria of a new project that has yet to adopt specific end goals.

~~~
philips
The project doesn't provide a POSIX filesystem directly. But, instead provides
a block device (think AWS EBS).

The initial goals that are in this release:

1\. An abstraction for replicated storage between machines in a cluster.

2\. A block device "application" on top that an ext4 filesystem could be run
on. This is like an EBS.

In the future someone might build other applications like "object storage" or
filesystems and we would love to get the feedback on the API to do that. But,
that isn't in the initial goals and we will be focusing on the storage and
block layers for the time being.

~~~
gsmethells
Will each disk in the system be weighted so that different size disks can be
used in the system over time? OpenStack Swift offers this and is useful to not
lock-in disk size at design time.

~~~
philips
Yes, see the admin documentation:
[https://github.com/coreos/torus/blob/master/Documentation/ad...](https://github.com/coreos/torus/blob/master/Documentation/admin-
guide.md#modify-my-cluster)

------
CSDude
Seems really nice. Would love to see it how it compares to Ceph, we were
considering Ceph for our container storage. For starters, it seems much more
simpler to deploy.

~~~
Inflatablewoman
In my old firm we managed to get ceph running on coreos using ceph in docker.
The storage was exposed using a s3 gateway.

[https://github.com/ceph/ceph-docker](https://github.com/ceph/ceph-docker)

~~~
ymse
Docker supports native RBD since version 1.8. Which should be both faster and
more reliable than using the S3 gateway (though that has other benefits and
becoming pretty good too).

------
nzoschke
I'm always in awe and grateful for the teams like CoreOS that tackle these
tough problems.

But when it comes to container storage I'm confused.

The 12 factor app methodologies work very well for every service I have
written and supported.

Processes/containers get ephemeral storage. When you need persistence you
delegate to a stateful service, generally Postgres and S3.

Id much rather write my apps around these simple and well understood
constraints then depend on magic file systems.

~~~
ecnahc515
The average 12 factor app doesn't really need this, but as you mentioned,
sometimes you delegate to Postgres, or S3, and today those are available as
hosted products via Amazon, Google, Heroku, etc.

However, when your running in your own datacenter, you don't have the luxury
of using Amazon's hosted products, so Torus exists to help provide some
building blocks to build your own S3, or potentially even RDS.

------
ceocoder
Congrats on the relaase! How does this compare to GlusterFS (loved it, was
free - I used it before they got acquired by RedHat) and Isilon (expensive but
fast and reliable). Is there an upper limit on number of nodes(?) or total
size?

------
koolba
So is this the building blocks for an in-house EBS?

~~~
philips
Yes, that is the idea. Torus block is like EBS. Torus "library" could be used
by other applications like an append only log, object store, etc.

~~~
stevelandiss
How do you say that with conviction? I looked at the code and it has no
bearings of being able to build a fault tolerant elastic block storage system.

------
hartror
No mention of Jepsen tests, I look forward to a @aphyr call me maybe talk/blog
post detailing all the issues.

------
teajunky
This reminds me of the SheepDog storage architecture
([https://sheepdog.github.io/sheepdog/](https://sheepdog.github.io/sheepdog/))
SheepDog provides safe distributed block devices and uses corosync or
Zookeeper instead of etcd.

------
js4all
Despite all the negative sentiment here, I am super excited about this. I use
CoreOS heavily and really like how everything just works. Running Kubernetes
on it, is the first cluster solution for me that works without configuration
orgies and is robust against machine outages. Torus seems to be the missing
piece. For now we use local volumes with sidecar containers for r/o storage
and nfs volumes for r/w storage.

All other solutions are not practical. GCE and EBS are only single mount.
iSCSI is unsupported in the cloud. Leaving only Ceph and Glusterfs, both
mentioned here, but needing heavy configuration.

~~~
lisivka
Gluster does not need heavy configuration: just add peers and create and start
volume - 3 commands. Not harder than lvm. Of course, gluster has tons of
options to fine tune cluster for various kinds of loads, which is not bad, but
confusing for newbies. But if you are newbie, why you need to change defaults?

~~~
js4all
That's interesting. When I was looking at it, it seemed overwhelming. Do you
have some pointers for the basics?

------
mrmondo
Watch out, next we'll be seeing CoreOS writing their own encryption protocols.
As someone that's designed and built storage clusters from the ground up -
take it from me: storage is not as easy as it seems if you care about your
data and performance at scale. The number of people I've seen using CoreOS and
then moving away from it is quite alarming, I feel like CoreOS will become the
Ubuntu of the container world.

~~~
jsmthrowaway
They're all-in on Kubernetes, which is quite apparent given (a) who's funding
them and (b) product direction, including this one. They don't have the
resources or backing to go after Docker; notice that changed? Docker has
something like 15x the funding and is pulling away in a lot of ways, so CoreOS
hitched on the Kubernetes wagon.

Better bet for destiny: Google is letting them build stuff off the books by
just funding them, and at some point they'll get quietly bought with all their
stuff added to Kubernetes, and that'll be that.

------
nwmcsween
So a userspace fs, I'm guessing this will use fuse to actually expose a POSIX
fs? Small sync writes will absolutely kill performance and will require all
sorts of hacks like glusterfs has had to implement due to the amount of
context switches. CoreOS really should have added whatever needed to ceph,
filesystems are not something you just hack together overnight.

~~~
betawaffle
We had a POSIX interface early on (via FUSE), but decided to expose a a block
storage interface first instead. This is not a "filesystem", it's a storage
abstraction, and we've spent more than 6 months on it. Seems like a lot of
folks are quickly jumping to conclusions, which is expected from "the
internet", but I would have hoped for better from HN.

~~~
nwmcsween
6 months is overnight in terms of a filesystem, there are literally millions
of dollars of work that was done with glusterfs and probably a magnitude less
man hours. I understand coreos wants.to innovate but what does torus bring to
the table besides negatives and NIH?

------
frozenice
Funny, just this week I was looking for an easy solution to cluster storage
for containers.

I found SXCluster and posted it here
[https://news.ycombinator.com/item?id=11812566](https://news.ycombinator.com/item?id=11812566)
\- not directly geared towards containers, but very easy to setup (no need for
etcd and the like).

------
amq
I just hope the performance is visibly better than of gluster. Especially for
small files.

~~~
zsmith928
are there any performance metrics for Tarus yet?

------
JulieONeilEMC
You could also check out EMC's Unity midrange storage at the EMC Store.
[http://bit.ly/1SIZ9N7](http://bit.ly/1SIZ9N7)

------
anonbanker
How is this different than Tahoe-LAFS?

------
kingofhawks
Is it possible that Facebook haystack will be open source?

------
jsmthrowaway
I love that I know who this is at CoreOS by the choice of username. Pick
better throwaways if you're going to blow up a thread like this. (ideal0227
upthread is also a CoreOS employee and one of the etcd developers.)

On topic, yes. It is quite trivial to lose data with etcd and pretty much
everybody I know who runs it has experienced problems. Try backing it up and
restoring. etcd is compsci great yet operationally burdensome; it is
_extremely_ difficult to operate, tends to make assumptions and explain to you
how to operate your fleet, and the developers are not very receptive to these
and other operational concerns. This stems from CoreOS culture, to be quite
clear -- CoreOS seeks to eliminate operations as a discipline and replace it
with software, and tends to disregard or devalue operational concerns as a
result.

Launching down the storage path with Go software and disregarding several
decades of operational experience in the industry on how to do distributed
storage correctly (not to mention the "looking down" on etcd-related
infrastructure decisions from this anonymous employee) should indicate to you
that I'm not making this up, despite how it may appear.

90% of folks I know on etcd, I'd estimate, have (a) reverted to Zookeeper or
(b) moved on to Consul. It is the single piece of software that is holding
Kubernetes back, too, and it's fairly obvious from roadmap direction that
Kubernetes is now _the_ etcd customer. Plan accordingly.

~~~
alexandre_m
Cockroachdb is built on top of etcd. So far seems to work very well for their
product.

I think this whole thread is a bit harsh and unfair to CoreOs and their
solutions.

~~~
vidarh
I think that reflects the complexity of a storage product more than CoreOS
itself. I use CoreOS. I like it. I love the update mechanism, and the tight
focus on running as much as possible in containers.

But I also have spent enough time using it to run across certain warts, and
Etcd clusters refusing to start without manual intervention etc. have been a
frequent enough problem that when they go after something incredibly hard such
as distributed storage, and it's relying on Etcd, I see that as a scary
combination.

------
zxcvcxz
It's threads like these where you start to wonder just how many marketing
teams are arguing with each other in the comments. I take the FUD with a grain
of salt: there is a lot of financial incentive to create FUD, where there is
financial incentive and little to no regulations a market will naturally
arise. It's going to get worse as the bubble pops and companies become more
and more desperate.

~~~
notacoward
I don't think it's marketing teams. There are at least a half dozen CoreOS
employees/contributors commenting here, plus a couple of known allies. On the
"other side" there's Bryan, me, and one person who seems to be an ex-employee.
AFAIK _none_ of them are in marketing, or even in coordination with marketing.
Certainly, I often get flak from my company's official mouthpieces for saying
things that conflict with their preferred talking points. They _wish_ they
could control what I say.

Mostly I think this is a matter of people naturally standing up for their
friends and colleagues, which is a wonderful thing, vs. people who have
specific concerns about The Right Way to do either technical or non-technical
things. No coordination or collusion is necessary. You can see exactly the
same thing happen for _every_ company or project that's discussed here, or on
Twitter, or wherever. Try criticizing a YC portfolio company some time. Not
all of the people arguing with you will admit their affiliations, and of
course the downvotes are all anonymous anyway. Just remember, the more skin
someone has in the game, the more tempted they'll be to cross that vague line
into astroturf. All the rest of us can do is admit our affiliations and
biases, and hope that people will get their heads out of the _ad hominem_
gutter enough to reach rational conclusions about the facts being presented.

