
Persisting state between AWS EC2 spot instances - p8donald
https://peteris.rocks/blog/persisting-state-between-aws-ec2-spot-instances/
======
manigandham
Persistent storage remains a complicated problem. Attaching volumes on the fly
with docker volume abstraction works well enough for most cloud workloads,
whether on-demand or spot, but it's still easy to run into problems.

This is leading to rapid progress in clustered/distributed filesystems and
it's even built into the Linux kernel now with OrangeFS [1]. There are also
commercial companies like Avere [2] who make filers that run on object storage
with sophisticated caching to provide a fast networked but durable filesystem.

Kubernetes is also changing the game with container-native storage. This seems
to be the most promising model for the future as K8S can take care of
orchestrating all the complexities of replicas and stateful containers while
storage is just another container-based service using whatever volumes are
available to the nodes underneath. Portworx [3] is the great commercial option
today with Rook and OpenEBS [4] catching up quickly.

1\. [http://www.orangefs.org](http://www.orangefs.org)

2\. [http://www.averesystems.com/products/products-
overview](http://www.averesystems.com/products/products-overview)

3\. [https://portworx.com](https://portworx.com)

4\. [https://github.com/openebs/openebs](https://github.com/openebs/openebs)

~~~
objectivefs
Using a clustered/distributed filesystem definitively simplifies persisting
the state between EC2 spot instances. It also makes it easier to scale out the
work load when you need more instances accessing the same data. To add to your
list: there is also ObjectiveFS[1] that integrates well with AWS (uses S3 for
storage, works with IAM roles, etc) and EC2 spot instances.

[1]. [https://objectivefs.com](https://objectivefs.com)

~~~
manigandham
This looks very interesting, good competition to Avere based on info so far.
Is there any native kubernetes integration in the works?

~~~
objectivefs
We are looking into the best way to add native kubernetes support. Currently,
you can add a mount on the host or directly mount the file system inside the
container. Both approaches work well, so it mainly depends on your preferred
architecture.

~~~
manigandham
A persistent volume provider would be great:
[https://kubernetes.io/docs/concepts/storage/persistent-
volum...](https://kubernetes.io/docs/concepts/storage/persistent-volumes/)

This makes it easy to declare the volume as part of the deployment and
automatically attach storage when the container is run. Mounting on the host
isn't very easy (or even possible sometimes), especially with spot/preemptible
instances and the increasing abstractions by managed K8S providers. The
pricing model might need to be different though if billing on a container-
mount level.

------
solatic
OP is offering some very dangerous advice.

Twenty years ago, software was hosted on fragile single-node servers with
fragile, physical hard disks. Programmers would read and write files directly
from and to the disk, and learn the hard way that this left their systems
susceptible to corruption in case things crashed in the middle of a write. So
behold! People began to use relational databases which offered ACID guarantees
and were designed from the ground up to solve that problem.

Now we have a resource (spot instances) whose unreliability is a _featured
design constraint_ and OP's advice is to just mount the block storage over the
network and everything will be fine?

Here's hoping OP is taking frequent snapshots of their volumes because it sure
sounds like data corruption is practically a statistical guarantee if you take
OP's advice without considering exactly how state is being saved on that EBS
volume.

~~~
colechristensen
Your response is fairly ridiculous.

A spot instance interruption isn't a system crash, it's a shutdown signal.
Storing your important spot instance data on EBS is recommended by AWS. If
your application can't handle a normal system shutdown without losing data,
your application is at fault, not your system setup.

>exactly how state is being saved on that EBS volume

Files are written to a filesystem which is cleanly unmounted at shutdown when
interruption happens.

~~~
derefr
And even if that wasn't true, network-attached storage (unlike local storage)
has no semantics for communicating a "partially completed" write of a block.
Your server either manages to send an iSCSI packet to the SAN with a completed
checksum, or it doesn't. Which means that—for the problems that _would_ arise
from a sudden power-cut to a VM (let's say from unexpected hypervisor
failure)—using a journalling filesystem on your network disks would perfectly
compensate for those problems.

~~~
solatic
Partially completed write of a block, sure. But partially completed write of a
file?

I can imagine (cough) an application where the application is trying to write
some binary blob to disk, doesn't finish before shutdown, and upon reboot,
tries to load the binary blob back into memory, fails because the binary blob
isn't consistent, doesn't handle the failure well, and refuses to boot.

App's fault? Sure. Does the customer care at 2 am? Nope.

~~~
colechristensen
Then all you're saying over and over is that in your imagination, not using a
long running instance is very dangerous because rebooting exposes the
fragility of your app.

Honestly, it's much safer in that circumstance to have a frequently rebooting
instance because it will quickly expose your app's fragility during normal
operations instead of that fragility being exposed in a disaster.

~~~
solatic
> it's much safer in that circumstance to have a frequently rebooting instance

I actually happen to agree with you in principle on this, and it's at the root
of my current side project.

But sometimes you just don't have the flexibility to fix or replace the app.
Ops engineering, like any other kind of engineering, is about dealing with
real-world constraints and making the most of the resources you have. Most
apps, on some notion of a fragility spectrum, are far closer to fragile than
to antifragile, because fragile is the default, and extensive stress-testing
to understand and plan for all failure modes before a production deployment
isn't typically feasible. At that point, if you can't fix it, you have to work
around it.

~~~
colechristensen
All you're doing is advocating larger, less frequent failures with people who
know less. Robustness isn't just about your software or your ops setup, but
also about your people and their knowledge and experience. I cannot see how
less frequent, more intense failures with people who know less is preferable,
and that anything else is "very dangerous advice"

You will ultimately have many fewer resources available if your strategy is to
gloss over failure modes by telling inexperienced engineers to hope they won't
happen. It's technical debt and the interest payments are very high.

~~~
sitkack
You are both right. But both wrong. If you want better consistency, use either
object storage or a database. If you are mutating multiple entities and need
consistency, now you need a distributed transaction.

But _ALL_ cloud providers provide warning before an instance is shutdown.
There is absolutely no reason, other than a crash for an instance to have a
hard shutdown.

~~~
colechristensen
He makes valid points, but in defense of an original ridiculous statement that
the articles suggestions are extremely dangerous. There are all sorts of
benefits to an ACID database, it's just not reasonable to scream about the
necessity of it because reboots are scary.

~~~
sitkack
I agree.

But! Lots of applications aren't built to handle partial writes, which will
absolutely occur if apps are hard killed. Any disucssion around this topic
should reference Crash-only Software [0][1][2] and Micro Reboots [3]

[0] [https://en.wikipedia.org/wiki/Crash-
only_software](https://en.wikipedia.org/wiki/Crash-only_software)

[1] [https://www.usenix.org/conference/hotos-ix/crash-only-
softwa...](https://www.usenix.org/conference/hotos-ix/crash-only-software)

[2] [https://lwn.net/Articles/191059/](https://lwn.net/Articles/191059/)

[3]
[https://www.usenix.org/legacy/event/osdi04/tech/full_papers/...](https://www.usenix.org/legacy/event/osdi04/tech/full_papers/candea/candea.pdf)

------
bdcravens
Spot instances can now "stop" instead of "terminate" when you get priced out,
persisting the attached EBS volumes:

[https://aws.amazon.com/about-aws/whats-new/2017/09/amazon-
ec...](https://aws.amazon.com/about-aws/whats-new/2017/09/amazon-ec2-spot-can-
now-stop-and-start-your-spot-instances/)

~~~
fredsted
This should really be at the top!

~~~
likelynew
EBS is not the root volume.

~~~
bdcravens
from the announcement:

"The EBS root device and attached EBS volumes are saved..."

Some instance types don't support an instance root, but require an EBS root.

------
otterley
Even if you don't use spot instances, the technique of using separate EBS
volumes to hold state is useful (and well-known). Ordinary on-demand instances
can also be terminated prematurely due to hardware failure or other issues, so
storing state on a non-root volume should be considered a best current
practice for any instance type.

------
fulafel
There's a mechanism exactly for this purpouse in Linux: pivot_root. It's used
in the standard boot process to switch from the initrd (initial ramdisk)
environment to the real system root.

ec2-spotter classic uses this, but you can also make a pivoting AMI of your
favourite Linux distribution.

One thing to watch out for is how to keep the OS automatic kernel updates
working. AMIs are rarely updated and you're going to have a "damn vulnerable
linux" if you don't get the updates just after booting a new image.

------
js4all
When you are using Kubernetes, you won't have to deal with this yourself. The
Cluster will move pods from nodes that are stopped because the spot price is
exceeded. Ideally place nodes at different bids. So there will be a
performance hit but no outage. With the new AWS start/stop feature [1] nodes
will come up again when the spot price sinks.

1) [https://aws.amazon.com/about-aws/whats-new/2017/09/amazon-
ec...](https://aws.amazon.com/about-aws/whats-new/2017/09/amazon-ec2-spot-can-
now-stop-and-start-your-spot-instances/)

------
yjftsjthsd-h
TLDR: Attach EBS volume and use that to store Docker containers.

I suppose it's a decent solution if you don't want to deal with prefixes.

------
Pirate-of-SV
To make this even more streamlined you'd tag the volumes and discover the
volumes with `aws ec2 describe-volumes` and filter unattached volumes with the
magic tag.

~~~
sevagh
There's a handful of tag-based automatic EBS volume attachers out there:

* [https://github.com/sevagh/goat](https://github.com/sevagh/goat) (my own) * [https://github.com/UKHomeOffice/smilodon](https://github.com/UKHomeOffice/smilodon)

------
stonewhite
We normally utilize spots with Spotinst + Elasticbeanstalk. Our billing looked
great ever since.

This solution looks good, yet only applies to single instance scenarios. I
presume this kind of thinking might move forward with EFS + chroot for an
actual scalable solution that cannot be ran on Elasticbeanstalk.

------
archgoon
So I was pleasantly surprised to discover that for the last several years,
spot instances have provided a mechanism that give you 2 minutes notice prior
to shutdown:

[http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-
inte...](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-
interruptions.html)

Learn something new everyday. :)

[https://aws.amazon.com/blogs/aws/new-ec2-spot-instance-
termi...](https://aws.amazon.com/blogs/aws/new-ec2-spot-instance-termination-
notices/)

~~~
bdcravens
See my top-level comment - you can now set "shutdown" behavior to stop instead
of terminate (though 2-minute notice still useful)

------
sciurus
The author goes to great lengths to come up with a way for the software that
was running on a terminated spot instance to be relaunched using the same root
filesystem on a new spot instance, but they never explain _why_ they need to
do _exactly_ this. Maybe they already ran everything in Docker containers on
CoreOS, so their solution isn't a big shift, but I strongly suspect they could
find a simpler way to save and restore state if they got over this obsession
with preserving the root filesystem their software sees.

------
olegkikin
If you don't care about reliability, why not just get a cheap and powerful
VPS? Paying $90/month for that machine is madness. I pay $6/month for 6GB RAM,
4 cores, 50GB disk.

~~~
deivid
Where? I'm using Digital Ocean and it'd be way more expensive for that kind of
configuration.

~~~
kuschku
Here’s a list of providers by cost:

[https://git.io/vps](https://git.io/vps)

(PS: Don’t use DigitalOcean, they tend to steal your credit if they feel like
it. Lost 100 bucks "promotional credit" that way with only a few days notice)

~~~
js4all
Same happened to me. I "lost" my all my credit. It was not promotional, but
something I had paid. They informed me on March 31th that I wouldn't be able
to use that credit after May 1st. :-( P.S. They had no expiration policy in
place when I added the credit.

Now I am happy with AWS.

------
ramanan
Well, one easy way when using Ubuntu-like distributions is to simply place
your `/home` folder on a separate (persistent) EBS volume [1].

With a few on-boot scripts to attach-volumes / start-containers, it should be
fairly easy to get going as well.

[1] [https://engineering.semantics3.com/the-instance-is-dead-
long...](https://engineering.semantics3.com/the-instance-is-dead-long-live-
the-instance-8b159f25f70a)

~~~
TrickyRick
This was exactly what I was thinking, why complicate things by replacing the
root volume when one can simply mount the disk to any other directory and
point the application there?

------
likelynew
I don't know why all the comments are saying this is bad idea. For me, one of
thing for I use EC2 is deep learning. I just use spot GPU instance, attach
overlayroot volume and launch jupyter notebook in it. Other things like google
dataflow is not useful to me due to the price and the process of installing
packages. I can also think of many other use cases for using some persistence
volume for some manual task.

------
amq
Wouldn't it be simpler to have the smallest possible instance run an NFS
server? This would also have an additional bonus of scalability.

Edit: or use AWS EFS

~~~
otterley
EFS is far more expensive than EBS. Price it out; you'll see.

~~~
Johnny555
It is 3X more expensive ($0.30/gb vs $0.10/gb for us-east), but it's
replicated across AZ's (so is more durable than EBS which is only replicated
within an AZ), and you only pay for what you use, you don't need to
overprovision the EBS volume to account for peak dataset size.

And since it's shared, you don't need to replicate data across multiple
nodes... so if 10 compute nodes needs access to the data set, they can all
just read it from the same EFS filesystem, no need to download it 10 times to
each compute node.

So EFS can still be very cost effective compared to EBS.

~~~
otterley
Are you counting the impact on the ENI's available bandwidth and additional
instance costs needed for more network throughput? As I understand it, EFS
requests are issued through the front end interface, while EBS requests go
through the storage backplane interface.

Also, NFS has different behavior with respect to buffer caching that needs to
be taken into account. It often does not cache as effectively as block storage
does.

------
raverbashing
Is it just me or to me spot instances should deal with work and not storage,
and hence your (stateful) units of work should be in a Queue/DB? (in a non-
spot instance)

Attaching and detaching volumes is a good idea but I wouldn't use that to keep
state

------
tuananh
we use k8s at work. i just have to create PVC and when spot instance
terminated along with the container; new container will be created and mount
the PVC again automatically.

------
jdchernofsky
Or you could just use Spotinst: [https://spotinst.com/](https://spotinst.com/)

------
alex_duf
It sounds wrong to try to keep the state across two ec2 instances. If you find
yourself in that situation, try pushing your state outside the ec2 instance a
bit harder. (dynamodb, s3 etc...)

You will get a _lot_ of benefit out of it, but may lose in performance, which
is fine in 99% of the cases.

