IIRC, we launched interactive serial port access sometime in late 2014. For example, mbrukman answered a SO question on Jan 2, 2015 with connect-to-serial-port . I don’t recall when we gained fancier IAM controls for it, but we’ve had it forever (and I think getting / view only was there at public launch).
Edit: and really the person who did the most work isn’t mentioned here (that’d be up to them)
This is a long time coming.
There were a whole lot of questionable prior decisions that had been made that did not help.
You learn a lot in situations like this about being helpful with the customer without being judgey. Get them back on their feet with a smile when they thought they were screwed keeps a lot of contracts around.
I like this paragraph. It says a lot for what divides the long term contractors with a full pipeline and those who don't. It would be a neat topic to blog on if you ever have the time.
One other option is to exploit a bug in managed software to escape to a shell. One man’s CVE or backdoor, is another support engineers magic sword to save the day.
Other implementations I've seen drop you right into a root shell, relying on equivalents to IAM to govern access to the other side of the virtual serial port rather than machine local permissions.
(Though if some downtime is allowed, then it's probably possible to get into single-user mode, then manually start the relevant daemon and its dependencies, basically doing whatever init would normally do; I've done this in my homelab on Slackware, but it ain't something I'd be excited to do in production, and systemd probably complicates things further)
Also, not sure about other distros, but I recall that Ubuntu normally requires a root password even for single user mode. You might be better off using a boot disk and chrooting your way in.
A simple "rely on IAM for access control" doesn't match with my idea of defense in depth.
"Your business might not exist if our engineers didn't dedicate themselves to your problem even though we didn't need to according to our SLA" goes a real long way at reup negotiation time with the right sized business (not too big, not too small).
Also serial console is pretty handy when you push bad firewall rules and don't want to throw away your instance.
I'm not an expert in hypervisors or anything like that and so I'm wondering what was stopping them from adding it in the past?
Pretty much all hypervisors support serial consoles, but usually those interfaces are limited to trusted admins. For something like AWS, they'll also have to connect it from the hypervisor hosts into their public UI, and they can't trust the users.
If an instance wedges itself onto a state where I need console access, I'd just kill it and provision a replacement (ideally, my monitoring and automation will have done that already and not even have woken me up to tell me).
I'm not sure I'd be at all comfortable having irreplaceable single points of failure in AWS. (Though I do recognise that people use it that way all the time...)
What it is actually saying is not that it's a good approach to just throw away servers when they start having problems. What's important is having that ability when it is necessary.
Servers don't just randomly fail. If your instance goes down, sure, your automation will recover your system, but it will still be important to know why it failed, because it may be a symptom of a deeper issue.
It can be due to a hardware issue, but even when that happens, you don't throw away good hardware; you fix it and the server can return to full operation. In the cloud, though, you have no idea what the hardware is doing. Maybe the instance failed because the underlying host failed; maybe it didn't. You should still find out.
Even at not-very-high scale, AWS instances _do_ "just randomly fail", at least for all practical interpretations. I don't run anything like FAANG scale, only hundreds of instances rather than thousands or millions, and I see at least a few "random failures" a year (not including spot instances terminating, which I see in clumps every month or so).
I (almost) never try to repair broken a EC2 instance. Wherever I can, they'll be running totally stateless, and I just provision new ones and kill off old ones. I probably won't even bother investigating if it's a rare and singular problem on a known-reliable platform. If one instance wedges and gets replaced, I'll just have a note to investigate if it happens again any time soon. If we get a second failure, we'll go looking in logs and maybe keep and investigate the EBS volume.
For platforms running new-ish code, procedures are different. If we see dead instances after deployments we obviously investigate the new code/config there. But a fair chunk of clients where I am only get 6 or 12 (or even 24) month backend update cycles, if I've got dozens of instances running the same code for months on end and _one_ dies, we just bury it and replace it, and keep a closer eye on the rest of the "herd" for a week or two.
When you manage your own physical servers, you have more knowledge of your risk. The actual time of failure will still be random, but if you've been running a host for 5 years straight, you know the risk is growing.
But we're mostly agreeing here. In the scenario where you throw away a "randomly failed" instance, the historical stability is good evidence that it is due to a hardware failure, and you can just replace the instance and move on.
Anything more esoteric than "a normal Ubuntu" can have a bug, e.g. it hangs with three network interfaces or similar.
They've gotten so far without this functionality that I have to wonder what finally tipped the balance into their offering it.
s/vendor is shipping AMIs without sshd, and they/TLA/
I wouldn't assume that VM migration does not exist in AWS. The overall design and implementation of Google's infrastructure somewhat mandated the development live migration support from day one. AWS was designed and built differently, and some types of events that force live migration in GCE do not exist in AWS.
One specific example from Google's VM Live Migration At Scale paper  is "Regular maintenance on the power infrastructure in our data centers requires powering down subsets of machines for extended periods of time". The power infrastructure at AWS is designed to be redundant and concurrently maintainable, which removes a significant need for workload mobility within the datacenter.
Personally, I think it was a very good idea to turn the thing that had to be built to launch into a marketed differentiated feature. But that doesn't mean that AWS doesn't have an ability to live migrate some workloads if it is able to do so without disrupting customers, or if it delivers a better experience than alternatives (e.g., instance degrade notices).
VM Migration is only for maintenance on GCP -- and customers can't control it, just Google.
AWS can hot patch live systems in place without any downtime, so, that's better than a migration (which has a brown out / maintenance period)
AWS's non-bare-metal systems can boot in ~10s with enough tuning.
Their bare-metal systems take tens of minutes to boot.
Nested virtualization would allow scaling up and starting new nodes much faster.
Even when you offer bare metal, it’s actually still nice to have nested virt! Otherwise, every node has to be a full sized host. So when you have a K8s cluster or similar with a pile of nodes and want to allow some teams to use it (e.g., Android emulator, firecracker, whatever), it’s really nice not to have to say “okay, this group requires full bare metal hosts that they manage themselves”.
tl;dr: nested virt is still a nice to have so that all your infrastructure looks the same.
Edit: Also, you can trigger migration yourself if you want (gcloud compute instances simulate-maintenance-event), but that's mostly to convince yourself that nothing bad will happen.
A few people were noting that SimpleDB has been deprecated, it's not listed in the AWS web console, you can't find docs for it anymore, but if you have a running instance, your service API calls still work. And I think there have been many deprecation warnings since, plus migration messages. But they don't want to break existing clients.
I'm guessing this is a similar case where they want to be really, really sure that it's worth offering the service.
Here are the docs: https://aws.amazon.com/simpledb/
And adding to the fun, watching it on an initial instance bootup seems to block the process AWS uses to grab the encrypted password out of the log. So, it's not useful, and makes the instance a bit hard to remote into.
It does require connecting to the working instance first and setting things up, which is less than ideal, but I get why it's the case. I guess I'm off to configure the Windows instances I care about, so I have a way to troubleshoot things in the future.
For general consumers, not much value IMHO.
I spent a lot of time looking at console screenshots of machines that would not boot and iterating to figure out the problem.
That’s a good default posture. What sucks is when you’re trying to debug a system that has OOM-killed sshd and then is behaving generally poorly. If you replace your instance with another one, you just get another OOM kill.
At this point, without interactive serial port access, you get to replace whatever you’ve got on the box with more logging statements. That’s a totally reasonable approach, but with interactive serial ports you can poke at it and root cause a lot faster.
Edit: Also, Linux seems to always kill sshd first. (Part of this is survivorship bias, of course).
Amusingly, this finally forced me to find bugs like this one:
(All processes started under a remote shell get adjustment -1000, which is basically never shoot me).
There are a few related to setting up the sshd adjustment itself as well.
So, looks like a config problem! Thanks for pointing this out.
Regarding SSH, if you enable sshd debug logging you can see that sshd sets its own score to the minimum possible  which is why your comment about sshd being targeted still doesn't make sense to me. I actually didn't know it was sshd doing this on its own till I ran this:
server ~ # grep oom_score_adj /usr/sbin/sshd
grep: /usr/sbin/sshd: binary file matches
Oct 13 23:20:38 server kernel: Mem-Info:
Oct 13 23:20:38 server kernel: [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Oct 13 23:20:38 server kernel: [ 2455] 60 2455 1778653 273114 710 10 0 0 mysqld
Oct 13 23:20:38 server kernel:  207 12085 22574 673 32 3 0 0 tlsmgr
Oct 13 23:20:38 server kernel: [ 4238] 0 4238 9234 518 20 3 0 -1000 systemd-udevd
Oct 13 23:20:38 server kernel:  0 12278 88107 5597 136 4 0 0 apache2
Oct 13 23:20:38 server kernel:  0 17222 1258983 142035 505 8 0 0 qemu-system-x86
Oct 13 23:20:38 server kernel:  0 21069 5033 487 14 4 0 0 bash
Oct 13 23:20:38 server kernel:  0 15935 7081 487 16 3 0 -1000 sshd
However I still don't understand your comment about the Linux OOM killer wanting to kill sshd "first" (or _ever_ based on these renewed findings!) Can you elaborate?
There is then also any case where you're debugging AMI builds and need to fix grub, or the init system without waiting 20 minutes for a new AMI build each time.
Also, the existing console log feature in AWS is insultingly not real time. It doesn't typically update at all unless you're within minutes of boot or trigger a reboot and it only buffers something like 4kb so a reboot can easily fully replace the logs. This really sucks when you're trying to get the debug console output, so this feature finally solves that.
Also, why would an SSH session, which is entirely in memory, time out because of I/O thrashing? You can disconnect the hard drive that sshd and/or the OS is running from and your SSH connections to that machine won't break. If you run some commands that aren't cached in memory you'll naturally get critical I/O errors, but it won't cause a disconnect on the SSH layer.
SSH is purely in memory, however, in order to allocate memory for it, linux will pull "free" memory out of whatever heavily fragmented corners it can find them in. And, it may even need to perform disk I/O to free memory that was tied up in various disk caches.
People refer to this as a "livelock", where Linux is going crazy doing lots of stuff but from userspace the system is completely frozen.
Facebook developed OOMD, a userspace oom killer to deal with this issue, their release blog post references the 30 minute livelocks they face: https://engineering.fb.com/2018/07/19/production-engineering...
They actually have gone so far as to submit kernel patches for newer PSI (pressure stall information) interfaces which they use in oomd to better detect stalls due to this thrashing
I suspect running low on memory can trigger symptoms that look like sshd failing.
sshd gets paged out (or something else you need for a successful login). Un-paging becomes incredibly slow, as there's lots of IO going on from all the paging. Anything garbage-collected starts running GC constantly, using 100% CPU.
Then your attempt to SSH times out - and with no access to list running processes, one naturally concludes sshd has failed.
In fact, I would guess (especially given all this investigation!) that it’s much more likely that an inaccessible box is just under too much memory pressure for sshd to respond.
Amusingly, the answer is still the same: serial port! :).
Thanks again for all the pointers (to everyone in this thread).
So maybe the parent post was remembering a time before then.
Personally, I'm in the first camp. I'm used to taking instances for granted on the rare occasion that a low-level issue arises and just promoting a replica or trashing it if it's stateless.
I'm assuming the use case where you care about fixing the type of issues this feature helps debug is fairly esoteric?
I remember having an EC2 terminal in the browser years ago and recently I went back and it seemed far more locked down.
- usually these are running with very few dependencies in the userspace stack, such as a getty directly spawned by init (or the modern equivalent arrangement). This means that it's accessible even if your networking stack is not working, or if you screw up your firewall config. You can even make sure your init / getty / bash are statically linked so that not even ld.so breakage will stop you.
- it looks like they're also enabling Linux's Magic SysRq features, which gives you some very raw hooks into the kernel itself.
I once implemented a tool called virt-dmesg which read out the log_buf from a running Linux kernel (surprisingly useful for those tricky crashes, but difficult from a maintenance point view so the tool is now abandoned). I suppose that's the closest you could get to a "real" console at the hypervisor level.
Based on a quick SSH it looks like it's a serial thing:
root 1221 /sbin/agetty -o -p -- \u --keep-baud 115200,38400,9600 ttyS0 vt220
root 1229 /sbin/agetty -o -p -- \u --noclear tty1 linux
Nitro is when you get the whole bare metal server and you need to run your own Hypervisor/ OS (which is why they mentioned VMware). This instance hasn't been available to the public very long (like a year or two). Maybe I am missing something here but I think a lot of comments seem to be mis understanding what this is
As fguerraz mentioned, modern AWS instance families are basically all powered by Nitro, which refers to the ecosystem around the hypervisor and hardware acceleration cards utilized. https://aws.amazon.com/ec2/nitro/
I would say they fulfill different purposes. The SSM agent has quite a bit of additional functionality, even within the Session Manager portion. It's more of your solution for online, general day to day access.
Serial console will let you fix issues when you have lost the ability to boot an instance, or network connectivity has failed. When SSH or Session Manager are available, I personally would opt to utilize them over the serial console. But if I have an instance that I can't reach via those, am unable to replace it for whatever reason, and need to bring it back online, serial console would be what I would reach for.
I made that up, but it would totally not surprise anyone, would it?