
When EC2 Hardware Changes Underneath You - usaar333
http://blog.picloud.com/2013/01/08/when-ec2-hardware-changes-underneath-you/
======
cperciva
I see no evidence here that EC2 hardware is "changing underneath" anyone;
rather, some EC2 instances are running on different hardware from other EC2
instances, and _PiCloud_ is moving the customer between different types of
hardware.

~~~
chewxy
An 'instance' is not a deployment of a virtual machine on a box. An 'instance'
can sit on top of multiple machines. For example, if you stop your instance
and restart it, you might note that your instance is now on another box with a
different CPU on it. This is the problem I think for PiCloud. One moment they
shut down their instance and the next, whee, no AVX

~~~
nicpottier
That is a very different thing than 'changing under you', which implies that a
running instance is being shuttled around to different underlying hardware,
which though I think is possible, isn't what is being described in TFA.

We don't even know that the case you are talking about, of stopping, then
starting an instance leads to it living on different hardware types, though I
would suspect that is true.

TFA is saying that when you boot NEW instances you sometimes get subtly
different hardware, which when deploying binary code can lead to issues.

That does seem like a valid EC2 bug, but also not one that most of us will
ever run into unless you are doing cool massive dynamic scaling like they are.

~~~
vacri
I don't know if AWS does it, but I remember going to a VMWare-sponsored
industry day several years ago, and they described that for enterprise
systems, the hardware was pooled and an abstraction layer sat between them and
the guests. Guests could be migrated between machines while they were running
and active. It allowed a system where in quiet times you could move your guest
load to only part of your hardware group and put the rest into low-power mode
(to save on power, it seems).

The best part of that day was a guy from the government who gave a great talk
on his experiences virtualising from running separate physical servers. While
he was very much in favour of it, he mentioned a few drawbacks that the vendor
talks obviously played down, but probably the best thing he said was "when
doing something major with infrastructure like this, get your boss at the top
on board early, because _everyone_ above you will try to scuttle it to CYA if
it goes wrong."

~~~
dbarlett
VMWare calls it vMotion [1]. You define the host cluster and it handles the
rest. Xen supports live migration [2] but EC2 does not.

[1] [http://www.vmware.com/products/datacenter-
virtualization/vsp...](http://www.vmware.com/products/datacenter-
virtualization/vsphere/vmotion.html#glance)

[2] <http://sysadmin.wikia.com/wiki/Live_migration_xen>

~~~
spydum
Fwiw, vMotion checks the cpu supported flags compatibility before migrating.
Pretty sure cpu must be all of the same family to be clustered together for
this purpose.

~~~
jimwalsh
VMware EVC simplified the CPU requirements that you mention. It can be very
restrictive still, or more lax depending on your requirements.

[http://kb.vmware.com/selfservice/microsites/search.do?langua...](http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1003212)

The original article just struck me as yet another company that doesn't know
how to properly run applications in virtualized environments. Yes, if your
hosting provider supports hot or cold migrations, you should be aware and
develop accordingly. I do see this quite often, that still today, people are
surprised that their VM is not always running on the same physical box at all
times.

~~~
xyzzy123
I think you're being overly harsh. Picloud is a super-awesome abstraction
layer for running Python over AWS.

The problem they have is that they are trying to support arbitrary Python
(which includes underlying libraries like LAPACK which depend on runtime CPU
detection working correctly) and expecting code will work the same across
identical instance types.

I think normal people would consider this a fair assumption, and in the event
that AWS's advertised CPU capabilities were not broken for some instances, it
would hold.

I don't think it's fair to say that they're ignorant of how virtualisation
works. Disclaimer: I am a huge picloud fan and they have saved me a lot of
time.

------
hayksaakian
Describe a problem many people might have. Write well. Slide your solution in
at the end.

I like it (no sarcasm). Good technique for writing a company blog.

~~~
dagw
Not only that, it introduced me to a company that somehow had completely
slipped under my radar and that I really really wish I'd known about few month
ago.

I will almost definitely be using them in the near future.

~~~
cschmidt
I'm a very happy PiCloud customer. They have "Environments", which let you log
into an instance and install whatever you want. You can then run your code on
that customized instance in the future. So your Python code can access any
weird compiled packages you want. You can also "publish" Python routines,
which can then be called through a RESTful interface. I use this to completely
decouple my website code from my computationally intensive code. (The Django
website just calls the published functions.) I've had good support on the few
issues I've had.

------
frozenport
The old version does AVX and the new version doesn't? Thats crazy! AVX can
result in a large speedup [1] in codes that would otherwise not vectorized.
For already existing codes it can be 20%.

The best strategy may be to work with ec2 or reject the AVX non-compliant
instances.

[1]
[http://www.behardware.com/medias/photos_news/00/30/IMG003051...](http://www.behardware.com/medias/photos_news/00/30/IMG0030514.gif?iact=hc&vpx=757&vpy=114&dur=12085&hovh=177&hovw=284&tx=90&ty=88&sig=108728287951082340176&ei=QCntUMHgIcrY2QWfsoDABQ&page=1&tbnh=147&tbnw=229&start=0&ndsp=18&ved=1t:429,r:16,s:0,i:138)

~~~
masklinn
Neither version does AVX, the old ones because it's not hardware-supported and
the new one because it's disabled in the hypervisor. But some packages
apparently don't fully check AVX support: they check that the hardware is AVX-
able, but they don't check if it's been disabled.

------
contingencies
Yep. Cloud stuff is still immature.

While guarantees that would remove all such issues ("leave me on system 'x'
with no shifting sands forever!") are costly and therefore undesirable within
a cloud context, this describes but one example of an edge case that could be
elegantly handled by informing the node that it must go down and come up again
a while later, after which its swapped-out guts with the potential to cause
issue might be more easily resolved.

I would generalize this in to a broader statement: "Environment-related
guarantees do need to be further specified on commercial cloud providers, and
a better interface given to clients when changes are scheduled".

Other areas of cloud APIs (particularly cross-cloud) that are presently
missing: legal jurisdiction, site/availability zone enumeration, available
hardware configuration enumeration (including network bandwidth policies),
resource guarantees (including network bandwidth).

I recently posted these as bugs to CIMI @ deltacloud's teambox -
<https://teambox.com/#!/projects/deltacloud/task_lists>

~~~
Goopplesoft
I wouldn't call this immaturity on the part of Cloud especially considering
how unique a case this is. Even at its maturity cloud isn't going to get to
the 100% perfect platform for all individualized use cases. These kinds of
checks/validations have to be handled by the client who relies on them, and it
seems like the cpuinfo check is doing just that perfectly fine.

~~~
dchichkov
Would you run a service on EC2, that you are planning to run over, say, next
decade? And want it to run in a setup&forget mode?

~~~
kalleboo
Would you use a hammer to drive in a screw?

Edit: The promise of cloud hosting is to be highly dynamic and let you scale
up or down at a moment's notice. In order to achieve that, there are
tradeoffs. It's silly to live with the cons if you don't need the pros. Each
tool for it's own job.

~~~
dchichkov
Have you tried maintaining some infrastructure over a decade? I did. And I
sincerely don't know, which way is better.

There are tradeoffs. Physical hardware doesn't change underneath you, but can
fail, require expensive maintenance itself and reliable infrastructure around
it (colo). Hosting providers tend to phase out services or even go out of
business completely. PaaS limits you, providers change the APIs all the time.

Having been a EC2 user since its announcement and first open beta, I'm
actually more and more inclined to think that it IS mature enough to be
considered. I'm pretty sure that if I'd try, I can find an AMI from 2007 and
it would run perfectly well.

On the other hand from costs perspective, reserved instances are not that
expensive, and, unlike regular hosting providers, costs pretty much guarantied
to go down, in the long term.

------
viktorsr
I wrote about hardware change issue and a couple of related issues a while
ago: <http://www.rotanovs.com/cloud/amazon-ec2-failures/>

Also, to clarify, hardware change occurs when you stop an instance (which
frees it, so it can be taken by another customer), and then start a new one
using the same EBS volumes.

------
krmboya
Can distributed systems be ever be fully transparent? It seems they are
susceptible to subtle bugs that make it hard for them to be so.

That said, with my undergraduate CS studies drawing to a close, I doubt I can
do such thorough debugging. Are there any useful guides/resources one can use
to understand and debug the various hardware architectures?

~~~
xyzzy123
Hi!

> I doubt I can do such thorough debugging. Are there any useful
> guides/resources one can use to understand and debug the various hardware
> architectures?

I'm not aware of any good asm or CPU arch books which cover avx more
accessibly than the Intel manuals but I imagine someone will correct me if a
good reference is available. I'd never heard of avx until today since those
instructions are new since the last time I had to look at x86 SIMD.

I would actually just recommend a general text like "Debugging: The 9
Indispensable Rules for Finding Even the Most Elusive Software and Hardware
Problems" as a first shot. I can't really answer your question in the way you
want :( There is no "magic book" which will explain everything you need to
know.

I'm an "intermediate level" debugger. I've spent perhaps a few hours a week
doing low-level debugging for the last couple of years on x86/x64 Windows and
Linux plus maybe 2 embedded archs. I can give some general advice as to what
might be a good use of your time if you want to get good at low-level
debugging. Obviously, anyone doing stuff like this as their career will need
to go a little bit further.

Mainly I am writing this because I think I can answer your question better
than saying "Read several thousand pages here:
[http://www.intel.com/content/www/us/en/processors/architectu...](http://www.intel.com/content/www/us/en/processors/architectures-
software-developer-manuals.html). It's not really a good use of time for most
people. It is not an entirely wasted investment in your early 20's, but 5
years down the track you won't remember much of it unless you live in the
debugger or work on a code generator.

The way everyone I know does it is to have a good handle on the general case
and research (google :) everything else as needed. You can learn the basics in
about 40 hours. I know many people who are talented at reversing and
debugging, and I can tell you that they do not have (or need) detailed mental
models of the SIMD extensions or (speaking for myself) even the FPU. It's cool
to know, but with the exception of domain-specific work (codecs, fast math) it
is not necessary to keep that information "swapped in". When you are working
low-level you will find there is an enormous amount of detail to the world.
That's the whole problem, and I can explain this with an analogy: imagine
trying to diagnose a cracked engine block with an electron microscope.

The general idea is to step back, know the basics and be prepared to apply
detailed analysis as neccesary.

Priorities:

    
    
      0. Determination, patience and "can do". Common sense and knowlege of general debugging strategies.
      1. How to operate the debugger
      2. Top 50 mnemonics and references on hand for the rest
      3. Calling conventions and ABI (http://en.wikipedia.org/wiki/X86_calling_conventions)
      4. Basic OS internals (location of key data structures, heap layout, syscall conventions)
    
      ... (everything else)
    

0\. The number one predictor of success is to have a kind of "I can do this"
attitude even though you might not necessarily know what you are doing. The
confidence that you can figure it out and the willingness to spend the time to
do it. You won't always be right, but you won't get far without it. You also
need the general principles of divide and conquer, basic logic and how not to
fool yourself.

1\. Knowing your way around the debugger really well is more useful than
knowing reams about, say, the specifics of CPU architecture. So, how to set up
symbols and sources, inspect the state of your process/threads. Most crashes
can be resolved with a backtrace and looking at a few locals (assuming you
have source).

2\. If you need to read asm, you only need to know the top 50 or 100 mnemonics
(if that). If you look at instruction frequencies you'll find the top 50
mnemonics make up more than 95% of all code by frequency, so you can work
quickly knowing just these and look up the remainder as required. I had a
reference for that (which I can't find) but a quick-and-dirty analysis (I did
this on x64 Ubuntu 10.04) goes:

    
    
      $ find /usr/bin -type f | xargs file | grep ELF | cut -d: -f1 | xargs -n1 objdump --no-show-raw-insn -d > /var/tmp/allops.lst   # dump all instructions from binaries in /usr/bin/ to /var/tmp
      $ egrep -i '  [0-9a-f]+:' /var/tmp/allops.lst | awk '{ print $2 }' | sort | uniq -c | sort -rn > /tmp/opfreq.list               # get sorted list of mnemonic frequency (highest at the top)
      $ head -100 /tmp/opfreq.list | awk '{sum += $0} END {print sum}'                                                                # accumulate frequency of top 100 mnemonics
      30229337     
      $  awk '{sum += $0} END {print sum}' /tmp/opfreq.list                                                                           # accumulate frequency of all mnemonics
      30356097
        

Top 50 is 97%. Top 100 is 99.6%. If you do a more granular analysis (involving
addressing forms etc) you'll find a similar conclusion holds.

3\. The ABI comes next; basically because you're not going to be able to make
sense of function prolog/epilog or the state of your stack without it.

4\. Knowing your memory map and OS specifics really help too (so e.g. on
Windows how to read the PEB/TIB, syscall convention for your OS, roughly how
the heap is laid out, whether a pointer is pointing towards a local, a heap
address or a library function). Again, only to a high level really.

\---

The normal way to debug something like this (after googling your error, of
course) would be to repro the crash, check the call stack, look at the source
code for the library and figure out what path takes you to where you crashed.
In this case you would work out reasonably quickly out that the crashing eip
is in some AVX-optimised LAPACK code and that LAPACK chooses this code at
runtime based on the advertised CPU capabilities. Then you would be confused
for a bit. Eventually you would figure out you're faulting because AVX
instructions don't work but only reach there because they're advertised. Hence
Amazon's bug. The whole process is pretty slow, but it's the standard and
obvious way of doing it.

However in this case the problem they had _really_ was that the crash was
intermittent.

Based on the narrative given, the picloud guys took a more "cloud-like"
approach to diagnosing the issue: they ran the "unreliable" code (plus some
environment scraping, I'm guessing) across a whole bunch of instances and
worked out by google and eyeball what was different about the crashy
instances. This is a practical way of doing it :) It's almost a kind of
"statistical debugging", if you want to put things into buckets. Most major
software vendors now get minidumps when their apps crash and this (statistical
debugging) is actually an interesting field of study in it's own right. It
could use some more postgrad. See e.g. [https://crash-
stats.mozilla.com/topcrasher/byversion/Firefox...](https://crash-
stats.mozilla.com/topcrasher/byversion/Firefox/18.0)

\---

I'm going to finish this off by explaining what would be better than reading
books and references: finding excuses to do it. It turns out that debugging is
mostly thankless and only buys you credit in a very limited social circle.
Truthfully it's not a good use of your life unless you're the kind of person
who enjoys it. Think of it like chess or go problems. If you want to be good
at it, you have to find an excuse. Some motivating activities people find for
doing low-level work (in no particular order) are:

    
    
      1) Cracking commercial software, writing game trainers, hacking online games (Download trials, or your game of choice)
      2) Writing exploits (Say, check CVEs, figure out if you can repro, debug until your eyes bleed, write an exploit) 
      3) Improving open source software (find a bug tracker, repro crashes, isolate the bugs)
      4) Doing crackmes (see e.g. http://crackmes.de/)
      5) Commercial reasons (work on a toolchain, compiler, embedded system ports, your $software)
    

\---

P.S: your complete problem solving breakfast should include repro first,
understanding your target, a bit of reasoning and guessing, dynamic analysis
(tracing first: Process Monitor/strace/ltrace, debuggers:
gdb/ddd/WinDbg/Immunity/OllyDbg, instrumentation: dynamorio/pin), static
analysis (objdump/IDA pro) and copious amounts of whatever will make your life
easier.

\---

If you are interested I can tell you some war stories about debugging problems
in distributed systems but this post is already too long.

~~~
krmboya
Thanks! Especially for the part of finding an excuse for doing this stuff.
Many times I start learning something, only to find I have zero motivation to
continue.

------
lkrubner
This is a true story, which happened to me last month:

I am at work. I log into an EC2 instance via ssh. I establish a screen
session. I do some work inside of screen. Go home after work, leaving screen
running.

I arrive at work the next day. Log into EC2. I type "screen -ls" and I am told
that there are no screen sockets. (In my experience, this usually means the
server has been restarted.) I am annoyed. I create a new screen session and
proceed to get some work done. That evening, I leave the screen session
running, and head home.

I arrive the next day at work. I log into the server. I type "screen -ls". I
am again told that there are no screen sockets. I am now very annoyed. I start
a new screen session and proceed to get some work done. That evening, as
before, I leave the screen session running, and I head home.

I arrive the next day at work. Once again, I log into the EC2 instance via
ssh. Once again I type "screen -ls". Once again I am told that there are no
screen sessions.

This happened 4 days in a row.

I was left feeling angry and I was left feeling like no EC2 instance could be
trusted. I also feel like it damages my productivity that I can not rely on
screen (I have in the past, on regular Linux servers, had screen sessions that
lasted for many months).

Right now I have all of my personal sites on the Rackspace cloud, which I
think was taken over from Slicehost. Although this is called a "cloud"
service, the "slices" feel like real computers to me -- I can have a screen
session that lasts for months.

The EC2 instances are strangely insubstantial, even when compared to other
services that promote themselves as cloud services. Personally, I prefer to
work with services that are at least solid enough that I can rely on screen
sessions.

~~~
nicpottier
I'm confused, what exactly do you think is happening? Obviously individual EC2
instances run for years without being rebooted or having processes die or
nobody would use it. (I've run services on EC2 since they first launched and
have never had such issues)

What's your theory on why your screen instances are dying and how would EC2 be
responsible for it?

~~~
lkrubner
I really do not know. I have not had the time to investigate how and why this
particular service might suffer so much on EC2. I do not know if our EC2
servies were suffering something that was unique to us, or whether this is a
general problem with EC2. I do know that I was annoyed as hell. And I know I
have not had this problem with other cloud services, such as the one offered
by Rackspace.

~~~
mh-
you never considered to check the uptime, dmesg?

~~~
alexkus
Also:-

The contents of /var/run/screen/ and the "S-<username>" subdir that should be
there...

The output of "ps u" and "ps -ef | grep s[c]reen"

