

Announcing Beta For Local SSD - cleverjake
http://googlecloudplatform.blogspot.com/2014/10/announcing-beta-for-local-ssd.html

======
benchess
"No planned downtime: Local SSD data will not be lost when Google does
datacenter maintenance."

Only unplanned downtime then :)

Live migration maybe adds another '9' to reliability, but customers still need
to plan for failure.

~~~
jsolson
(note: I work on GCE networking)

Yes, certainly, it's always possible that hardware will fail with no notice,
and any architecture needs to account for that (this is equally true if you're
running your own infrastructure, of course). For some applications downtime in
these situations may be the right value tradeoff, for others they'll need to
architect things assuming always-on redundancy. In terms of data durability,
though, local SSD gets you a performance win by being local. If there's a
catastrophic failure of that machine, data is gone.

What live migration buys you, the prospective cloud customer, is transparent
avoidance of (a) Google's planned hardware maintenance windows (for example,
network or power infrastructure maintenance) and (b) outages due to hardware
failures which can be detected and migrated away from before they cause data
loss.

The second category includes any number of situations which, if you can't
migrate your workload, require taking a machine out of service, replacing some
gear, bringing it back up, and hoping the replacement fixes the issue[0]. If
we can instead migrate your VM away from the issue (say, for example, it's a
bad root hard disk -- we need to replace the drive as lots of things in the VM
hosting environment depend on it, but your workload is otherwise completely
unaffected), we are able to service the affected machine with zero downtime
for your service.

[0]: Think about the number of things that could manifest as a flaky network
connection. Could be any (or all) of: bad cable, bad NIC, bad motherboard, bad
CPU, or even bad RAM (since NICs use bus-mastering DMA from host memory in
most cases). After migrating the workload away, we can take as long as
necessary to reliably diagnose and fix the machine. Huge win for us and for
our customers.

------
pwarner
With the provisioning flexibility I wonder if these are internal or external?
If internal, seems like they eat a lot of cost if they are never provisioned.
If external, how do they keep them as fast as local?

~~~
wmf
They're using NVMe, so the SSDs are either really local or at least in the
same rack using a PCIe switch. One could imagine that different servers have
different numbers of SSDs and when you create a VM with N SSDs they put it on
a server with that many.

~~~
runarb
That is quit interesting. I am currently looking into seting up a virtual
platform that will use SSD disks.

Do you have a source I can look at for more info on what SSD setup Google
Compute Engine uses?

~~~
wmf
TFM: [https://cloud.google.com/compute/docs/local-
ssd](https://cloud.google.com/compute/docs/local-ssd)

------
contingencies
_Optimization considered harmful: In particular, optimization introduces
complexity, and as well as introducing tighter coupling between components and
layers._ \- RFC3439, via
[https://github.com/globalcitizen/taoup](https://github.com/globalcitizen/taoup)

~~~
contingencies
Pfft. Classic Hackernews... downvote to pluto without discussion despite
validity of alternate perspective. One day you kids will learn :)

To spell it out, "Optimization considered harmful" means time may be misspent
caring about these things, and/or actually result in changes to your
application or service that will take it backwards in measurable ways.

"In particular, optimization introduces complexity, and as well as introducing
tighter coupling between components and layers." means, for example, that
relying on Google Cloud's SSD support to run your service locks you in and
makes you dependent on them, their ability to provide availability, their
internal decision on how long they choose to maintain that service offering,
and their internal decisions on pricing. It further suggests that this time
may be better spent getting to the heart of the problem, eg. removing the disk
I/O performance constraint from your application that made you want an SSD in
the first place by sharding your datastore.

This is a pretty valid line of thinking as an antidote to 'rah rah'.

