
Announcing Linode Block Storage Volumes - ljoshua
https://blog.linode.com/2018/02/01/announcing-linode-block-storage-volumes/
======
erikrothoff
Speaking as Linode user with all my eggs in their basket: I'm sad that they
are falling so much behind other players like DigitalOcean. I'm looking more
and more to DO these days just because of the speed they are able to deliver
new features.

That being said this is a great addition! Looking forward to trying it out
when it reaches my datacenter. And I'm also looking forward to see what their
next big project will be.

~~~
chatmasta
Why does block storage seem to take so long for companies to implement?
DigitalOcean only implemented it in the past year, when they had hundreds of
employees. It seems like it would be a priority feature, so I imagine a
sizable team was working on it. Why did it take so long for digitalocean, and
now Linode, to implement block storage? Are there some inherently
architecture-dependenent complexities that render it a deceivingly difficult
project to implement?

~~~
nickvanw
I was the Engineering Manager for the Storage team at DigitalOcean that took
the Block Storage project from conception to launch (though I no longer work
there), so I might be able to shed some light.

In general, it's really hard to do at-scale network-backed storage - by the
time your applications get access to the file system, there are a myriad of
abstractions that aren't always receiptive to the idea of the network "going
away", or even a modicum of lag. On top of that, in order for it to be
profitable, you need to work at a massively-shared scale. This means expensive
SSDs, servers and switches that require a lot of capex and no guaranteed
revenue because it's a new product. For us, this meant building entirely new
network architecture in some places to allow the massive amount of data being
shared in the storage cluster and across VMs to not overwhelm existing
traffic, etc.

In order to create the reliability in network and persistency that your normal
application desires, you need extremely strong consistency and low latency.
Every replication strategy (replication and erasure coding) requires each
write to touch more than one SSD/HDD/NVMe device in order to acknowledge the
write, and that all needs to happen in a shared system with an immense amount
of contention, every time.

It takes a while because you only get one opportunity to get all of this right
- it's one thing if the network has a few more blips in a month, or if there's
a bit more CPU contention than you'd like, but you absolutely can't lose
peoples' data.

I can understand why companies are so hesitant to do this - there may be
technical debt in their software/network stack that makes it very difficult,
or they may not want to proceed unless they have the right set of experts
working on the project.

~~~
neom
How do you like Ceph generally as a technology, rest of the implementation
aside, thoughts on Ceph?

~~~
antongribok
Not the OP, but I've been running production Ceph clusters for the past 4.5
years at two different fortune-50 companies.

We've had very good success with Ceph for block storage and a fairly rough
time with it for object storage. We're currently doing our best to try to
improve it (both our own upstream contributions and collaborating with
RedHat).

From a technology standpoint, I think it is very interesting and for the most
part has a lot of very good engineering. However, it is fairly complex and
even today it's very easy to have a hard time with it when starting out. You
really need to pay attention to every detail and your hardware selection is
extremely important.

It is extremely resilient, and goes to great lengths to preserve your data.
Ceph can be performant, however that requires very good hardware and network.

My experience is limited up to the Jewel release (we haven't upgraded to
Luminous and we are not planning on using BlueStore anytime soon).

~~~
rodgerd
> However, it is fairly complex and even today it's very easy to have a hard
> time with it when starting out.

Sage's talk at LCA covered the work they're doing here;
[https://www.youtube.com/watch?v=GrStE7XSKFE](https://www.youtube.com/watch?v=GrStE7XSKFE)

But yes, at small scale Gluster is still a lot easier to deploy and run.

------
infogulch
I wonder how difficult it would be to build an S3 like interface on top of
this so you can get a coarse pay-for-what-you-use, avoid downtime, but also
have a much larger capacity than the max of 10TB for one drive.

You might be able to build this on top of minio. Start with, say, 4 1GB
linodes (smallest), with 8 volumes each (the max) of the smallest volume size
of 10GB, and a somewhat low 1:3 parity in minio (redundancy is mostly handled
by linode replication). That would be 320GB * $0.10 + 4 * $5 = $52/mo to start
with. After some utilization threshold, incrementally resize all drives to
grow dynamically, the parity drives would fill in while a drive is offline &
resizing. The parity is also enough to resize the linodes one at a time too if
you need to up their compute capacity. This system could grow up to 320 TB raw
/ 240TB accessible.

The last I poked around at this idea when linode block storage was introduced,
this "should" work with minio, but I got the impression they didn't really
consider this kind of use case.

~~~
jerf
Depends on how many 9s you want in your reliability. The first couple aren't
too bad, gets harder after that.

~~~
infogulch
If you run it at a higher level of abstraction than linode volumes with parity
and can tolerate up to 8 drives / 1 server going down at once (like this setup
on minio should get you), that would take you a long ways towards adding one
or two 9's.

High availability is just one S3 feature though. Other important features are
paying for what you use, and effectively unlimited storage growth (you will
probably revisit this before you hit $32k/mo the 320TB would cost). Even if
this didn't add reliability, those other features still have utility above the
raw Block Storage Volumes provided by linode here.

------
hemancuso
I wish these block storage services gave you some idea of failure
rate/durability and availability. Amazon publishes some rough volume loss
rates but not even Google tells you what kind of durability to expect out of a
persistent volume. They all say they are tri-replicated, which semi-implies
highly durable storage. What about availability?

Lastly, I'd love to know if DO/Linode have custom rolled their solution or are
using Ceph or something similar. Not that I don't trust them, but they aren't
recruiting the same engineers as Google.

~~~
wmf
Since they just started offering it, Linode probably doesn't have accurate
statistics to share, and most people can't correctly interpret very small
probabilities anyway. They'd probably be better off saying something like "you
should assume that each volume will fail at some point in its life".

~~~
hemancuso
They have been offering it since June, FWIW. And it is worth knowing the order
of magnitude of expected failure rates compared to just running against the
local SSD.

------
antongribok
I know that DigitalOcean uses Ceph under the hood. Does anyone know what
Linode is using?

~~~
tkulick
We are leveraging Ceph for our Block Storage solution.

