

Magical Block Store: Why EBS Can't Work - lindvall
http://joyeur.com/2011/04/24/magical-block-store-when-abstractions-fail-us/

======
blantonl
I am an active user of EBS on a highly trafficked Web properly, and came from
a long and tedious background in enterprise software.

I really think that one paragraph in his blog post summed everything up quite
nicely. It could not ring more true:

 _My opinion is that the only reason the big enterprise storage vendors have
gotten away with network block storage for the last decade is that they can
afford to over-engineer the hell out of them and have the luxury of running
enterprise workloads, which is a code phrase for “consolidated idle
workloads.” When the going gets tough in enterprise storage systems, you do
capacity planning and make sure your hot apps are on dedicated spindles,
controllers, and network ports._

------
edw
This awesome entry perfectly captures why I have always hated NFS. I can deal
with the possibility that if a machine's hard drive dies, my system is going
to have a very hard time continuing to operate in a normal manner, but then
NFS comes along, and you realize that all sorts of I/O operations that
previously employed a piece of equipment that failed once every two and a half
years now depend on a working network with a working NFS server on that
network, and the combination of that network and that server are orders of
magnitude less reliable.

And now you have situations on a regular basis where you type "ls" and you
shell hangs and not even "kill -9" is going to save you. And you go back to
using FTP or some other abstraction that does not apply 40,000 hour MTBF
thinking to equipment that disappears for coffee breaks daily.

~~~
ssmoot
I don't know how NFS keeps coming up. It's an entirely different use case. It
doesn't help the credibility of a critique on networked block storage to harp
on a vendor specific implementation of a technology that doesn't even operate
in the same sphere.

An NFS server is very simple. With NFS on it's own VLAN, and some very basic
QoS, there's no reason an NFS server should be the weak point in your
infrastructure. Especially since it's resilient to disconnection on a flaky
network.

If you're looking for 100% availability, sure, NFS is probably not the answer.
If on the other hand you're running a website, and would rather trade a few
bad requests for high-availability and portability, then NFS can be a great
fit.

None of that has anything to do with EBS or block-storage though.

Joyent's position is that iSCSI was flaky for them because of unpredictable
loads on under-performing equipment. The situation would degrade to the point
that they could only attach a couple VM hosts to a pair of servers for
example, and they were slicing the LUNs on the host, losing the flexibility
networked block-storage provides for portability between systems.

Here's what we do:

We export an 80GB LUN for every running application from two SAN systems.

These systems are home-grown, based on Nexenta Core Platform v3. We don't use
de-dupe since the DDT kills performance (and if Joyent was using it, then is
local storage without it really a fair comparison?). We provide SSDs for ZIL
and ARCL2.

These LUNs are then _mirrored on the Dom0_. That part is key. Most storage
vendors want to create a black-box, bullet-proof "appliance". That's garbage.
If it worked maybe it wouldn't be a problem, but in practice these things are
never bullet-proof, and a failover in the cluster can easily mean no
availability for the initiators for some short period of time. If you're
working with Solaris 10, this can easily cause a connection timeout. Once that
happens you must reboot the whole machine even if it's just one offline LUN.

It's a nightmare. Don't use Solaris 10.

snv_134 will reconnect eventually. Much smoother experience. So you zpool
mirror your LUNs. Now you can take each SAN box offline for routine
maintenance without issue. If one of them out-right fails, even with dozens of
exported LUNs you're looking at a minute or two while the Dom0 compensates for
the event and stops blocking IO.

These systems are very fast. Much faster than local storage is likely to me
without throwing serious dollars at it.

These systems are very reliable. Since they can be snapshotted independently,
and the underlying file-systems are themselves very reliable, the risk of
data-loss is so small as to be a non-issue.

They can be replicated easily to tertiary storage, or offline incremental
backup easily.

To take the system out, would require a network melt-down.

To compensate for that you spread link-aggregated connections across stacked
switches. If a switch goes down, you're still operational. If a link goes
down, you're still operational. The SAN interfaces are on their own VLAN, and
the physical interfaces are dedicated to the Dom0. The DomU's are mapped to
their own shared NIC.

The Dom0, or either of it's NICs is still a single point of failure. So you
make sure to have two of them. Applications mount HA-NFS shares for shared
media. You don't depend on stupid gimmicks like live-migration. You just run
multiple app instances and load-balance between them.

You quadruple your (thinly provisioned) storage requirements this way, but
_this_ is how you build a bullet-proof system using networked storage (both
block (iSCSI) and filesystem (NFS)) for serving web-applications.

If you pin yourself to local storage you have massive replication costs, you
commit yourself to very weak recovery options. Locality of your data kills you
when there's a problem. You're trading effective capacity planning for panic
fixes when things don't go so smoothly.

This is why it takes _forever_ to provision anything at Rackspace Cloud, and
when things go wrong, you're basically screwed.

Because instead of proper planning, they'd rather not have to concern
themselves with availability of your systems/data.

It's not a walk in the park, but if you can afford to invest in your own
infrastructure and skills, you can achieve results that are better in every
way.

Sure, you might not be able to load a dozen high-traffic Dom0's onto these SAN
systems, but that matters mostly if you're trying to squeeze margins as a
hosting provider. Their problems are not ours...

~~~
chubot
The point of the article is that you are taking an ancient interface and using
it for something new. Millions of lines of code was written against that
interface with old assumptions, and now you've moved it to a new
implementation without changing any of it. Things are bound to go wrong.

When you move sqlite to NFS, for example, file locking probably won't work.
There is nothing to tell you this.

It sounds like you have experience making NFS work well, but I don't see how
anything you wrote addresses this point. In fact I think you're just echoing
some of the article's points about "enterprise planning". AFAICT you come from
the enterprise world and are advocating overprovisioning, which is fine, but
not the same context.

~~~
ssmoot
I work at a small shop who was badly burned by Sun/Oracle. :-)

It's not that I believe in overprovisioning I think. It's that if data is
really that critical, and it's availability is critical, then that has to be
taken into account during planning.

Everything fails at some point. The Enterprise Storage Vendors would have you
believe their stuff doesn't. In practice it's pretty scary when the black box
doesn't work as advertised anymore though _after_ you've made it the
centerpiece of your operations.

So with those lessons learned, our replacement efforts took into account the
level of availability we wanted to achieve.

I did go off on an NFS tanget. Sorry. But this article was about block-
storage, which is a different beast from what you describe.

Seeing all networked storage lumped together is like seeing: FastCGI isn't
100% reliable, which is why I hate two-phase-commits.

------
prodigal_erik
He didn't touch on Joyent's 2+ day partial outage a couple months ago:
<http://news.ycombinator.com/item?id=2269329>

~~~
jamie
Don't forget about this:
[http://www.datacenterknowledge.com/archives/2008/01/15/joyen...](http://www.datacenterknowledge.com/archives/2008/01/15/joyent-
backup-services-down-for-three-days/)

~~~
jamie
I think both of these links illustrate that errors happen, mistakes happen,
software has bugs, and murphy's law always strikes. The question is, when it
strikes, do you have enough control to fix the problem? If you've outsourced
the solution, does the provider have enough control/knowledge to fix the
problem?

These things will get much worse before they get better, and it's best to
think of all these abstractions as being a double edge sword.

------
SoftwareMaven
Many things in software are impossible magic, until they are not. His argument
boils down to "it is a hard problem that nobody has solved yet." That doesn't
mean nobody will ever solve it.

Regardless, I do agree that building your application today like it is a
solved problem is the wrong way to do it.

~~~
blantonl
_Regardless, I do agree that building your application today like it is a
solved problem is the wrong way to do it._

That presumption assumes that the application is being used as the right tool
to resolve the problem. And it also assumes that "the problem" is a finite and
solvable item.

~~~
sigil
> And it also assumes that "the problem" is a finite and solvable item.

Yes. To make this a bit more concrete, if "the problem" is making distributed
storage look and behave exactly like local storage, the CAP Theorem has
something to say about its solvability.

~~~
jamesaguilar
Depends. Local storage is also not perfectly available. If the network is
reliable, you can probably get availability high enough that the system feels
"close enough" to how local storage feels. Today's networks aren't that
reliable, but someday there may be enough redundancy and bandwidth for this to
happen.

~~~
sigil
> Local storage is also not perfectly available.

Technically true, although you don't have to contend with the consistency or
partitioning factors in the local disk case -- there's only one copy of the
state. This means you can focus on making the availability factor as close to
1.0 as possible.

This may not be the case when you're forced to balance all three CAP factors.
I sometimes wonder if a follow on result to CAP will be a "practical"
(physical or information theoretic) limit like C x A x P <= 1-h for some
constant h, and we'll just have to come to terms with that as computer
scientists, as physics had to with dx x dp >= h. This is of course wildly
unsubstantiated pessimism.

Also, I would gladly entertain any argument demolishing the "local disks are
not subject to CAP" claim I made above by talking about read / write caches as
separate copies of the local disk state.

~~~
jamesaguilar
I doubt it. Suppose that there is a network that is never partitioned, and
machines connected to that network that never fail. In that case consistency
and availability should be perfect. Although networks will never be perfectly
reliable, nor machines, they seem to be getting more reliable. Perhaps someday
we may be able to say that the odds of enough partitions or machine failures
to make the system unavailable are lower than the odds of you getting struck
by lightning, at which point you will have for practical purposes defeated the
constraints of the CAP theorem.

~~~
sigil
> Suppose that there is a network that is never partitioned, and machines
> connected to that network that never fail. In that case consistency and
> availability should be perfect.

You mean, in that case tolerance to partition and availability should be
perfect.

> Perhaps someday we may be able to say that the odds of enough partitions or
> machine failures to make the system unavailable are lower than the odds of
> you getting struck by lightning, at which point you will have for practical
> purposes defeated the constraints of the CAP theorem.

So this is the really interesting question. All the CAP theorem says is that
(C,A,P) != (1.0,1.0,1.0). How close to (1.0,1.0,1.0) could we make (C,A,P)? If
infinitely close, then we have achieved perfection by the limit, and the CAP
theorem is rather pointless. If not, then what _is_ the numeric limit?

As you speculate, maybe the numeric limit on C x A x P is so close to 1.0 that
the odds of seeing a consistency, availability, or partitioning problem are
much smaller than getting hit by lightning.

Then again, maybe not. Who knows? ;)

To avoid sounding like a total crackpot, here is an interesting paper that
explores the physical limits of computation:

<http://arxiv.org/pdf/quant-ph/9908043v3>

~~~
jamesaguilar
> You mean, in that case tolerance to partition and availability should be
> perfect.

No. If a network is never partitioned, you don't need to write algorithms that
can tolerate partitions. Therefore consistency and availability are possible.

> So this is the really interesting question. All the CAP theorem says is that
> (C,A,P) != (1.0,1.0,1.0). How close to (1.0,1.0,1.0) could we make (C,A,P)?
> If infinitely close, then we have achieved perfection by the limit, and the
> CAP theorem is rather pointless. If not, then what is the numeric limit?

I think you have misunderstood the theorem (at least, if my bachelor-degree-
level understanding is correct). C, A, and P are not variables you can
multiply together or perform mathematical operations on. They are more like
booleans. "Is the web service consistent (are requests made against it
atomically successful or unsuccessful)?" "Is the web service available (will
all requests to it terminate)?" "Is the web service partition-tolerant (will
the other properties still hold if some nodes in the system cannot communicate
with others)?" These questions cannot be "0.5 yes". They are either all-the-
way-yes or all-the-way-no.

> . . . and the CAP theorem is rather pointless

Not really. It is pointful for networks that experience partitions. It just
doesn't apply to reliable networks. It also sort-of doesn't apply when an
unreliable network is acting reliably, with the caveat that since it is not
possible to tell in advance when a network will stop behaving reliably, you
still have to choose between these three properties when writing your
algorithms for when the network behaves badly.

~~~
sigil
> C, A, and P are not variables you can multiply together or perform
> mathematical operations on. They are more like booleans.

Right, but I wasn't restating CAP, just wondering about a follow on to CAP
that considers the _probability of remaining consistent_ , the _probability of
remaining available_ , and the _probability of no failures due to network
partitions_ in physical terms.

Is this not an interesting thing to consider? What if someone proves a hard
limit on the product of these probabilities in some physical computation
context? The CAP theorem is absolutely fascinating to me, especially if it has
something real to say about the systems we can build in the future. The future
looks even more distributed.

> It is pointful for networks that experience partitions. It just doesn't
> apply to reliable networks.

Is there such a thing as a "reliable" network when thousands or millions of
computational nodes are involved? Are the routers and switches which connect
such a network 100% available? If an amplification attack saturates some
network segment with noise, what then?

As programmers, we desperately want things to work, and it's easy to greet
something like CAP with flat out denial. I know I'm always fighting it. "It
will never fail." No, it _can and will fail_.

~~~
jamesaguilar
I still don't understand what you mean when you say, "probability of remaining
consistent," etc. Either you wrote the service so it system would always
remain consistent or you didn't. Similarly with availability. Either the
system will always return a result, or it may sometimes hang.

Maybe what you mean is the probability of whichever of C, A, or P you gave up
actually becoming a problem? But I cannot imagine a physical law of the form
you are referring to applying uniformly to these disparate properties. I
wouldn't even know how to formulate it for consistency. For availability and
partition tolerance it would just be, "Requests to this service will
(availability: hang forever/partition-tolerance: return with errors) at a rate
exactly equal to the probability of network failures."

With regards to your last point, there are no reliable networks, at least
where I work. That doesn't mean there won't be.

------
johnb
It's funny how disk abstractions get you every time.

We used to store and process all of our uploads from our rails app on a GFS
partition. GFS behaved like a normal disk _most_ of the time, but we started
having trouble processing concurrent uploads and couldn't replicate in dev.

It turned out so GFS could work at all, it had different locking than regular
disks. Every time you created a new file it had to lock the containing folder.
We solved it by splitting our upload folder in 1000 sequential buckets and
wrote each upload to the next folder along... but it took us a long time to
stop assuming it was a regular disk.

~~~
sciurus
FWIW, this behavior is explained early on in the documentation for GFS2.

~~~
johnb
As we were using EngineYard for hosting at the time, everything was set up for
us and we never thought to look it up.

We now pay a lot more attention to underlying stack. Just because you've
outsourced hosting (either cloud or managed physical servers), you really need
to know every component yourself.

------
spullara
Also worth noting is that Amazon isn't forcing you to use EBS. They also have
tons of fast local storage available to RAID as you wish.

~~~
lindvall
I strongly believe one of the most positive aspects of EC2 was that it
demonstrated a beautiful philosophy that a node and their disks should not be
relied upon to always be around and pushed it into the mainstream.

Even for people who didn't use EC2 the existence of the platform caused more
people to rethink their architectures to try to rely less on Important Nodes.

EBS is a step back from that philosophy and it's a point worth noting.

One of the great things this post does is enumerates some of the underlying
reasons why relying on EBS will inevitably lead to more failures and in ways
that are harder and harder to diagnose.

~~~
leoc
> EBS is a step back from that philosophy and it's a point worth noting.

Amazon doesn't use EBS itself, right? Isn't EBS something that AWS allowed its
customers to nag it into against (what it considers) its better judgement?

~~~
lindvall
Yep. And this may be one of those cases where they would have been better off
ignoring their customer requests for the good of their reputation and their
customers uptime.

------
cagenut
Its really fascinating to watch amazon re-learn/re-implement the lessons IBM
baked into mainframes decades ago. Once you get out of shared-nothing/web-
scripting land you realize that I/O is much more important and difficult than
cpu. What amazon calls EBS IBM has been calling "DASD" forever. I wonder if
there are any crossover lessons that they haven't taken advantage of because
there just aren't any old ibm'ers working at amazon.

~~~
blantonl
IBM's implementation of DASD on the mainframe was always implemented under the
assumption that it was a _secondary_ storage medium for data. Meaning, it
wasn't accessed often, and it wasn't implemented for top performance.

Think of a bridge between high performance disk and tape.

------
spudlyo
_Trying to use a tool like iostat against a shared, network provided block
device to figure out what your level of service your database is getting from
the filesystem below it is an exercise in frustration that will get you
nowhere._

This may be true under Solaris. Since 2.5 Linux has had /proc/diskstats and an
iostat that shows the average i/o request latency (await) for a disk, network
or otherwise. For EBS it's 40ms or less on a good day. On a bad day it's 500ms
or more if your i/o requests get serviced at all.

------
alecco
Amazon Six Sigma "Blackbelts", meet Mr. Black Swan.

Edit: my point is you can't hide unexpected/unknown events on statistical
models; we should know better, coming from CS.

------
lobster_johnson
> It’s commonly believed that EBS is built on DRBD with a dose of S3-derived
> replication logic.

Actually, it was discovered some time ago
([http://openfoo.org/blog/amazon_ec2_underlying_architecture.h...](http://openfoo.org/blog/amazon_ec2_underlying_architecture.html))
that EBS probably used Red Hat's open-source GNDB:
<http://sourceware.org/cluster/gnbd/>

------
CPlatypus
He only gets it half right. A filesystem interface instead of a block
interface is the right choice IMO. Private storage instead of distributed
storage is the wrong choice for capacity, performance, and (most importantly)
availability reasons. They didn't go with a ZFS-based solution because it was
the best fit to requirements. They went with it because they had ZFS experts
and advocates on staff.

As Schopenhauer said, every man mistakes the limits of his own vision for the
limits of the world, and these are people who've failed to Get It when it
comes to distributed storage ever since they tried and failed to make ZFS
distributed (leading to the enlistment of the Lustre crew who have also
largely failed at the same task). If they can't solve a problem they're
arrogant enough to believe nobody can, so they position DAS and SAN as the
only possible alternatives.

Disclaimers: I'm the project lead for CloudFS, which is IMO exactly the kind
of distributed storage people should be using for this sort of thing. I've
also had some fairly public disputes with Bryan "Jackass" Cantrill, formerly
of Sun and now of Joyent, about ZFS FUD.

~~~
ssmoot
ZFS is just the FS. But you know that already.

The SAN solutions they migrated to are not ZFS based. Unless I'm mis-
remembering (I read this a couple days ago) they were only using ZFS to slice
LUNs.

Point is, you're taking pot-shots at ZFS when the main thrust appears to be:
"It was hard to make iSCSI reliable. Once we did, by buying expensive storage-
vendor backed solutions, we found it wasn't financially compelling."

They're a hosting provider. If it takes a replicated SAN pair (which is the
wrong way to go about it BTW, though admittedly it's the way the storage
vendors and their "appliance" mentality would have it done) to service just a
pair of VM hosts (they're still using Zones right?) then it just didn't make
sense money-wise for them. If they planned capacity to provide great
performance, they weren't making enough money on the services for what they
were selling them for.

That's not an "iSCSI is unreliable" problem. It's not a "networked storage is
broken" problem. It's not a "networked storage is slow" problem. It's not even
a "ZFS didn't work out" problem.

If you go out and spend major bucks on NetApp, not only are you going to have
to deal with all the black-box-appliance BS, but it's going to cost a lot of
money. A LOT. And DAS is going to end up cheaper to deploy, maintain, and your
margins are going to be a lot higher.

DAS is the right choice for a hosting provider who wants to maximize their
profits in a competitive space.

It's not the best choice for performance, availability or flexibility for
clients though. So you have to ask yourself what kind of budget you have to
work with, and what goals are important to you?

BTW, there's _budget_, and then there's NetApp/EMC budget. Just because you
need/want more than DAS can give you doesn't mean you need to tie your boat to
an insane Enterprise grade budget.

~~~
CPlatypus
Perhaps you should RTFA. The author says explicitly that _what they do now_ is
"lean on ZFS" and "keep the network out of the storage solution" which made
their provisioning more complex because they could no longer treat local disks
as ephemeral (i.e. that data can't be assumed to exist anywhere else). I knew
this when I wrote the GP. My whole point is that _they treated it_ as a
"networked storage is broken" problem even though it wasn't, because of their
"ZFS is the only tech we need" bias. Thanks for re-stating that.

As for "DAS is the right choice" that's just wrong on many levels. First,
people who know storage use "DAS" to both private (e.g. SATA/SAS) and shared
(e.g. FC/iSCSI) storage, so please misusing the term to make a distinction
between the two. Second, I don't actually recommend either. I don't recommend
paying enterprise margins for anything, and I don't recommend more than a
modicum of private storage for cloud applications where most data ultimately
needs to be shared. What I do recommend is distributed storage based on
commodity hardware and open-source software. There are plenty of options to
choose from, some with all of the scalability and redundancy you could get
from their enterprise cousins. Just because some people had some bad
experience with iSCSI or DRBD doesn't mean all cost-effective distributed
storage solutions are bad and one must submit to the false choice of
enterprise NAS vs. (either flavor of) DAS.

In short, open your eyes and read what people wrote instead of assuming this
is the NAS vs. DAS fight you're used to.

~~~
ssmoot
They "lean on ZFS" for DAS.

Seriously. You tell me. What does that have to do with your rant on ZFS? It
could have as well been an LSI controller doing RAID6. Or mdadm. Doesn't
matter.

That's the evolved solution they came up with.

The "networked storage is broken" pitch actually comes in with the EMC/NetApp
interim solution as well. I don't buy it either, but it's a joke to claim the
problem was ZFS on the Zones when the Targets weren't running ZFS.

You're awfully prickly, but I didn't suggest it came down to "Enterprise" NAS
vs DAS. I actually think networked storage is here to stay (and that's a good
thing).

I have my doubts we'll see a stable, inexpensive (or free) Distributed or
Clustered file-system ready to replace traditional solutions anytime soon. I'm
happy to see people try though.

You clearly have an axe to grind with ZFS though. In my experience it's been
by far more stable than any available Linux FS I've used. Pull the power again
and again, replace and resilver all you want. Manage terabytes and don't worry
about corruption. I wouldn't trust ext3/4fs for anything I couldn't stand to
lose...

PS: <http://en.wikipedia.org/wiki/Direct-attached_storage>

"People who know storage". I don't see iSCSI on that list. Nor FCoE. DAS (at
least according to Wikipedia) explicitly rules out switching. Which is how
I've always viewed it.

~~~
CPlatypus
You're really not getting it, are you? I never said ZFS was the problem, as
you seem to think. I'm just saying it's not the solution either. It's a crappy
solution, failing to protect against host failures and creating myriad
problems in provisioning around the fact that each VM's storage is stranded on
one node until it's explicitly copied somewhere else. And if you don't think
there are decent distributed filesystems out there, you're just not keeping up
with the field and shouldn't be commenting on it.

~~~
ssmoot
I don't think I am getting it no. You _don't_ think ZFS is the problem?

So you _aren't_ calling ZFS a "crappy solution"? Just the DAS usage?

What is your gripe exactly then? The overblown critique of networked storage?
Well we agree on that at least then. I think.

Honestly, with all the "read the fucking article", it's-not-DAS, oh-it-is,
CloudFS is way moar better than ZFS, I never said ZFS sucked, "Bryan ZFS
Cantrill is a jackass" you've left me absolutely bewildered at what your
intended point (if any) actually is?

For the record, my only comment on (free) distributed filesystems (that aren't
vendor-locked and actually unusable to me) is that I wouldn't personally trust
them with my data. Not until they have the features I need, and then are
running out in the wild, widley deployed for a couple years so I'm not a
guinea pig.

I'll even throw you a bone: Even just last year ZFS was having major melt-
downs when a new inadequately vetted feature was added. A few years ago it
wasn't uncommon to face corruption when trying to do fairly routine things
managing disks. Bugs can and do happen.

Maybe CloudFS, or Gluster is ready for prime-time, housing terabytes of data
reliably and never making a misstep. I just don't think it's smart to bet your
business on it. Not at least without a plan B since moving data around isn't
an option when you're down and have terabytes you need to get back online.

