Persistent storage remains a complicated problem. Attaching volumes on the fly with docker volume abstraction works well enough for most cloud workloads, whether on-demand or spot, but it's still easy to run into problems.
This is leading to rapid progress in clustered/distributed filesystems and it's even built into the Linux kernel now with OrangeFS [1]. There are also commercial companies like Avere [2] who make filers that run on object storage with sophisticated caching to provide a fast networked but durable filesystem.
Kubernetes is also changing the game with container-native storage. This seems to be the most promising model for the future as K8S can take care of orchestrating all the complexities of replicas and stateful containers while storage is just another container-based service using whatever volumes are available to the nodes underneath. Portworx [3] is the great commercial option today with Rook and OpenEBS [4] catching up quickly.
Also want to highlight that AWS will now allow spot instances to just be stopped instead of terminated, so only compute power is removed but data is persisted automatically as long as you use EBS root/attached volumes.
Using a clustered/distributed filesystem definitively simplifies persisting the state between EC2 spot instances. It also makes it easier to scale out the work load when you need more instances accessing the same data. To add to your list: there is also ObjectiveFS[1] that integrates well with AWS (uses S3 for storage, works with IAM roles, etc) and EC2 spot instances.
We are looking into the best way to add native kubernetes support. Currently, you can add a mount on the host or directly mount the file system inside the container. Both approaches work well, so it mainly depends on your preferred architecture.
This makes it easy to declare the volume as part of the deployment and automatically attach storage when the container is run. Mounting on the host isn't very easy (or even possible sometimes), especially with spot/preemptible instances and the increasing abstractions by managed K8S providers. The pricing model might need to be different though if billing on a container-mount level.
Twenty years ago, software was hosted on fragile single-node servers with fragile, physical hard disks. Programmers would read and write files directly from and to the disk, and learn the hard way that this left their systems susceptible to corruption in case things crashed in the middle of a write. So behold! People began to use relational databases which offered ACID guarantees and were designed from the ground up to solve that problem.
Now we have a resource (spot instances) whose unreliability is a featured design constraint and OP's advice is to just mount the block storage over the network and everything will be fine?
Here's hoping OP is taking frequent snapshots of their volumes because it sure sounds like data corruption is practically a statistical guarantee if you take OP's advice without considering exactly how state is being saved on that EBS volume.
A spot instance interruption isn't a system crash, it's a shutdown signal. Storing your important spot instance data on EBS is recommended by AWS. If your application can't handle a normal system shutdown without losing data, your application is at fault, not your system setup.
>exactly how state is being saved on that EBS volume
Files are written to a filesystem which is cleanly unmounted at shutdown when interruption happens.
And even if that wasn't true, network-attached storage (unlike local storage) has no semantics for communicating a "partially completed" write of a block. Your server either manages to send an iSCSI packet to the SAN with a completed checksum, or it doesn't. Which means that—for the problems that would arise from a sudden power-cut to a VM (let's say from unexpected hypervisor failure)—using a journalling filesystem on your network disks would perfectly compensate for those problems.
Common filesystems only do metadata journaling, so your file contents are not protected by this. As an exception, the ext3 and ext4 filesystems support a data journaling mode using a special flag.
Even if you had data journaling, it won't give you consistency between different files. This post used Gitlab as an example, and git will break if some files in its databse are updated, but some not. Git doesn't use fsync to ensure their update order, I don't know if Gitlab enables it or if the performance hit is reasonable.
Partially completed write of a block, sure. But partially completed write of a file?
I can imagine (cough) an application where the application is trying to write some binary blob to disk, doesn't finish before shutdown, and upon reboot, tries to load the binary blob back into memory, fails because the binary blob isn't consistent, doesn't handle the failure well, and refuses to boot.
App's fault? Sure. Does the customer care at 2 am? Nope.
Then all you're saying over and over is that in your imagination, not using a long running instance is very dangerous because rebooting exposes the fragility of your app.
Honestly, it's much safer in that circumstance to have a frequently rebooting instance because it will quickly expose your app's fragility during normal operations instead of that fragility being exposed in a disaster.
> it's much safer in that circumstance to have a frequently rebooting instance
I actually happen to agree with you in principle on this, and it's at the root of my current side project.
But sometimes you just don't have the flexibility to fix or replace the app. Ops engineering, like any other kind of engineering, is about dealing with real-world constraints and making the most of the resources you have. Most apps, on some notion of a fragility spectrum, are far closer to fragile than to antifragile, because fragile is the default, and extensive stress-testing to understand and plan for all failure modes before a production deployment isn't typically feasible. At that point, if you can't fix it, you have to work around it.
All you're doing is advocating larger, less frequent failures with people who know less. Robustness isn't just about your software or your ops setup, but also about your people and their knowledge and experience. I cannot see how less frequent, more intense failures with people who know less is preferable, and that anything else is "very dangerous advice"
You will ultimately have many fewer resources available if your strategy is to gloss over failure modes by telling inexperienced engineers to hope they won't happen. It's technical debt and the interest payments are very high.
You are both right. But both wrong. If you want better consistency, use either object storage or a database. If you are mutating multiple entities and need consistency, now you need a distributed transaction.
But ALL cloud providers provide warning before an instance is shutdown. There is absolutely no reason, other than a crash for an instance to have a hard shutdown.
He makes valid points, but in defense of an original ridiculous statement that the articles suggestions are extremely dangerous. There are all sorts of benefits to an ACID database, it's just not reasonable to scream about the necessity of it because reboots are scary.
But! Lots of applications aren't built to handle partial writes, which will absolutely occur if apps are hard killed. Any disucssion around this topic should reference Crash-only Software [0][1][2] and Micro Reboots [3]
> If your application can't handle a normal system shutdown without losing data, your application is at fault, not your system setup.
Unless something in the system shutdown fails to give the application what it needs (for instance, time) to shutdown cleanly. Which is entirely possible considering that Amazon is selling you the spot instance on the given assumption that it can give the hardware at any time to somebody who is willing to pay more. Amazon does not guarantee the time needed for a clean shutdown (only that a two-minute warning will be available via their proprietary mechanism, if you architect your application to monitor for it) for a spot instance anywhere in their documentation, and you would be ill-advised to not architect for that.
> Storing your important spot instance data on EBS is recommended by AWS
Because EBS itself is reasonably reliable. If you have configuration data (i.e. in /etc) for a legacy application that isn't managed, it's reasonable to mount that data on EBS since it's rarely written to and writes are generally human-initiated and human-monitored (with operations policy possibly mandating a snapshot even before any changes are made).
That's still very different from daemon writes to /var. Take for instance, the PostgreSQL documentation which warns that snapshots must include WAL logs in order for the snapshot to be recoverable, and that it is quite difficult to restore from a snapshot if you stored your WAL logs on a different mount: https://www.postgresql.org/docs/10/static/backup-file.html
You need to understand precisely how your application is treating your storage and act accordingly. Thinking that all applications interact with storage the same way is dangerous and liable to cause data corruption and loss. That's all.
Spot instances are shut down cleanly via the usual stop semantics (which includes all the shutdown handlers provided your OS supports them). Assuming your database software supports clean shutdowns via SIGTERM, everything should be fine.
You're assuming that people are saving their state in databases to begin with. If you're saving state to a database in production, typically you're communicating with that database over a network connection, and not running the database on the same machine as your application. Containerizing databases is a whole separate issue.
OP's specific example is saving /var/opt/gitlab to an EBS volume and expecting to be able to move it from one spot instance to another without corruption. That strikes me as insane.
What is so insane about this? It's no different than plugging in a USB drive, modifying some data on it, then disconnecting. Except in this case, the mount/unmount happens outside of the application's lifecycle so it can initialize and shutdown cleanly without worry.
And if GitLab (or whichever other application) is hanging and the stop script fails to cleanly shut down the application?
Shit happens at scale, it's precisely why ACID guarantees are important. Specifically in GitLab's case, because configuration is stored under /etc/gitlab, relying on EBS snapshots as a safeguard against corruption only works if the snapshot is taken of the entire FS, not just /var/opt/gitlab. If your machine is properly provisioned from an AMI or at least from some kind of configuration management, and you have some kind of reasonably-enforced policy which only permits changes through those management systems, then maybe you can get away with only taking a snapshot of /var/opt/gitlab, but now we're getting into the territory of "I understand how my data is being stored to the EBS volume (in this case, according to documented GitLab instructions) and I am acting accordingly". Then, if the /var/opt/gitlab snapshot ends up being corrupted, the odds of getting an uncorrupted snapshot increase with the more snapshots that you try, and this is probably good-enough in this specific instance because if you needed a better guarantee than that, you'd have a proper HA setup.
This pattern is a lot safer if you use ZFS. Spot instances don't just disappear though, you get notification and have a chance to perform shutdown actions, except in the case of hardware failure - which is the same with non-spot instances.
- EBS, being block storage, doesn't recognize the filesystem format on top of it, and therefore doesn't recognize if you formatted the block storage as ZFS and therefore will not use ZFS snapshots when using Amazon's native EBS snapshotting. If you wish to use ZFS snapshots, you have to build that on top of what Amazon gives you, along with all the other aspects of ZFS storage, i.e. building a ZFS storage pool from separate EBS volumes. I mean, it would be nice if Amazon had a hosted ZFS solution, but so far, doesn't seem like it.
- Yes, you get a notification, but it's a proprietary notification scheme that your application must be designed to poll for. Why can't Amazon use standard signals like SIGPWR to indicate imminent shutdown?
- Just because it isn't smart for non-spot instances doesn't suddenly make it smart for spot instances ;)
SIGPWR is anything but standard, and it's unclear how AWS would even send that signal to your processes without adding an agent to the instance.
Currently they initiate an ACPI shutdown event at the termination time. It's hard to initiate a shutdown in a more standardized manner. An instance shut down via this signal will generally see the init process begin gracefully stopping services, eventually halting on it's own. Typically your init process will get increasingly aggressive with kill signals, as defined by your service definitions, eventually getting to SIGKILL. If your init process fails to get the vcpu halted, after a (undocumented?) period AWS will halt the cpu(s) for you. This is about as graceful a shutdown as you're going to get with 'standard' interfaces.
Termination Notifications go out of their way to give you an extra heads up, in case your application is unlikely to gracefully handle being shut down by the init system. Think DB hosts with a craploads of dirty blocks that take a few minutes to sync to disk at shutdown.
Even if you don't use spot instances, the technique of using separate EBS volumes to hold state is useful (and well-known). Ordinary on-demand instances can also be terminated prematurely due to hardware failure or other issues, so storing state on a non-root volume should be considered a best current practice for any instance type.
There's a mechanism exactly for this purpouse in Linux: pivot_root. It's used in the standard boot process to switch from the initrd (initial ramdisk) environment to the real system root.
ec2-spotter classic uses this, but you can also make a pivoting AMI of your favourite Linux distribution.
One thing to watch out for is how to keep the OS automatic kernel updates working. AMIs are rarely updated and you're going to have a "damn vulnerable linux" if you don't get the updates just after booting a new image.
When you are using Kubernetes, you won't have to deal with this yourself. The Cluster will move pods from nodes that are stopped because the spot price is exceeded. Ideally place nodes at different bids. So there will be a performance hit but no outage. With the new AWS start/stop feature [1] nodes will come up again when the spot price sinks.
To make this even more streamlined you'd tag the volumes and discover the volumes with `aws ec2 describe-volumes` and filter unattached volumes with the magic tag.
We normally utilize spots with Spotinst + Elasticbeanstalk. Our billing looked great ever since.
This solution looks good, yet only applies to single instance scenarios. I presume this kind of thinking might move forward with EFS + chroot for an actual scalable solution that cannot be ran on Elasticbeanstalk.
So I was pleasantly surprised to discover that for the last several years, spot instances have provided a mechanism that give you 2 minutes notice prior to shutdown:
The author goes to great lengths to come up with a way for the software that was running on a terminated spot instance to be relaunched using the same root filesystem on a new spot instance, but they never explain why they need to do exactly this. Maybe they already ran everything in Docker containers on CoreOS, so their solution isn't a big shift, but I strongly suspect they could find a simpler way to save and restore state if they got over this obsession with preserving the root filesystem their software sees.
If you don't care about reliability, why not just get a cheap and powerful VPS? Paying $90/month for that machine is madness. I pay $6/month for 6GB RAM, 4 cores, 50GB disk.
I would look at providers like OVH and even cheaper (Treudler, TransIP, RamNode, etc.)
For example, an SSD with 2 vCPUs, 8GB RAM and 40GB SSD is 13.49$ per month from OVH.
(PS: Don’t use DigitalOcean, they tend to steal your credit if they feel like it. Lost 100 bucks "promotional credit" that way with only a few days notice)
Same happened to me. I "lost" my all my credit. It was not promotional, but something I had paid. They informed me on March 31th that I wouldn't be able to use that credit after May 1st. :-(
P.S. They had no expiration policy in place when I added the credit.
For anyone curious: DO issued a ton of promo credit in the past, with an unlimited redemption period, then eventually last year said that credit would expire 12mo after redemption - effective after a month.
They backtracked on that regarding non-promo credit (referrals etc) and gave a 1-year grace period.
FWIW, I've been very happy with DO, had a couple $5 VPSes there for 3-4 years and they've been remarkably reliable. One host migration, one SLA credit and lengthy failure analysis, and a bunch of notifications ahead of time for maintenance. More than I'd expect for most hosts in the price range.
Not the most powerful for your money, of course, but awesome if you need to run some services with a public IP and consistent uptime.
Actually, they only emailed users to warn them about this ca. 10 days before it was revoked.
I had gotten $100 promotional credit from DO with the GitHub student pack, and planned to use it in my second year of university, as I knew we had to do a practical project there where I’d need it. Well, a few weeks before that project was about to start, I got the email from DO telling me they’d invalidate all my credit next week. In the end, I hosted that project with OVH, and spent over 80€ on it.
But that was extremely annoying, and while I originally wanted to also move servers of a few projects I was hosting to DO, after this I decided not to.
I said 1 month because the initial email I got was on 3/31/16, stating expiration effective 5/1, then another email retracting the expiration of my referral credits on 4/27.
A lot of people lost not only promotional credits on DO, that you get for free, but referral credits too, that you get by sending traffic, which, you know, actually worth something. So, yeah, trust is something DO doesn't deserve.
This is a question of trust. I have to trust that DO will keep my data safe, that, if the US government would be after my data, DO would prevent them from accessing it. I have to trust that DO won’t access my data.
How am I supposed to trust my, and my user’s personally identifying data, to a company that just like that revokes credit, without warning, and says "well, if you ask nicely, you can get it back"?
> ...if the US government would be after my data, DO would prevent them from accessing it.
This is completely unrealistic. If the [local jurisdiction government] is after your data, they'll have your host, ISP, and anyone else give it to them.
(Inexplicable downtime = your server being imaged.)
Believing anything else, IMO, is purely delusional.
Why? You’d trust all your private data, and your customers data, to a company that just tried to scam you out of money if you hadn’t been careful? (Maybe "scam" is a strong word, but the result is the same – changing the ToS to revoke credit with only a weeks warning certainly is shady)
This was exactly what I was thinking, why complicate things by replacing the root volume when one can simply mount the disk to any other directory and point the application there?
I don't know why all the comments are saying this is bad idea. For me, one of thing for I use EC2 is deep learning. I just use spot GPU instance, attach overlayroot volume and launch jupyter notebook in it. Other things like google dataflow is not useful to me due to the price and the process of installing packages. I can also think of many other use cases for using some persistence volume for some manual task.
NFS is nice but a single instance can easily become network bound, especially on AWS. It also introduces a single point of failure for that instance, and clustered NFS can be fragile.
It is 3X more expensive ($0.30/gb vs $0.10/gb for us-east), but it's replicated across AZ's (so is more durable than EBS which is only replicated within an AZ), and you only pay for what you use, you don't need to overprovision the EBS volume to account for peak dataset size.
And since it's shared, you don't need to replicate data across multiple nodes... so if 10 compute nodes needs access to the data set, they can all just read it from the same EFS filesystem, no need to download it 10 times to each compute node.
So EFS can still be very cost effective compared to EBS.
Are you counting the impact on the ENI's available bandwidth and additional instance costs needed for more network throughput? As I understand it, EFS requests are issued through the front end interface, while EBS requests go through the storage backplane interface.
Also, NFS has different behavior with respect to buffer caching that needs to be taken into account. It often does not cache as effectively as block storage does.
And while we are talking about costs, make sure you check for unused WBA volumes frequently, as you still pay for them if they aren't attached/used - and sometimes a dev will create a provisioned iops drive and forget to delete it and you pay a lot for those volumes..
Is it just me or to me spot instances should deal with work and not storage, and hence your (stateful) units of work should be in a Queue/DB? (in a non-spot instance)
Attaching and detaching volumes is a good idea but I wouldn't use that to keep state
we use k8s at work. i just have to create PVC and when spot instance terminated along with the container; new container will be created and mount the PVC again automatically.
It sounds wrong to try to keep the state across two ec2 instances. If you find yourself in that situation, try pushing your state outside the ec2 instance a bit harder. (dynamodb, s3 etc...)
You will get a lot of benefit out of it, but may lose in performance, which is fine in 99% of the cases.
This is leading to rapid progress in clustered/distributed filesystems and it's even built into the Linux kernel now with OrangeFS [1]. There are also commercial companies like Avere [2] who make filers that run on object storage with sophisticated caching to provide a fast networked but durable filesystem.
Kubernetes is also changing the game with container-native storage. This seems to be the most promising model for the future as K8S can take care of orchestrating all the complexities of replicas and stateful containers while storage is just another container-based service using whatever volumes are available to the nodes underneath. Portworx [3] is the great commercial option today with Rook and OpenEBS [4] catching up quickly.
1. http://www.orangefs.org
2. http://www.averesystems.com/products/products-overview
3. https://portworx.com
4. https://github.com/openebs/openebs