Your job descriptions list a lot of technologies but dont mention anything to do with operating systems or networking. It seems odd to explicitly call out Node and redis but not anything down the stack.
It's a good point, but right now we don't do anything that relies specifically on one technology at the very low levels (for instance, kernel modules or very specific networking optimizations we don't need and would be over-optimization).
Instead we try to maintain a SOA which is agnostic to what underlying tools back our services, something that we've found Docker is useful for.
That being said, we generally just use Ubuntu, and have some other AWS-specific setups around networking. If you like hacking on Cisco routers, I think it's perhaps not the right position :)
I don't think there's anything wrong with that. Anyone in a sysadmin or devops is going to be expected to be proficient at Linux and bash, Windows and Powershell for Windows desktop/server admins, Cisco and Juniper for networking, etc. These are kind of default assumptions for positions like these.
More modern and highly specific technologies like HAproxy, Salt, Node, Redis are not going to be known by every experienced, competent applicant though, and while you can often pick them up on the job if you have general experience in the area, there's always going to be a learning curve.
I interviewed for a DevOps role at Clever a few months ago and the interview was 100% python algorithm questions. I did it correctly and was told how do I make it faster.. So I did and was told.. how can I make it faster.. After 30 minutes of a skype interview I told the guy I wasn't interested and thanked him for his time.
Let this be a lesson: teach your children about hubs before you teach them about switches! :) So then, that inevitable question will come up: "Well, how does it know WHICH port?"...
It is true that I haven't seen a hub in about 5 years, which means (100% anecdotally) that if someone is just graduating college now and started learning exactly 4 years ago, they might never have seen a hub...
I have a treasured and jealously guarded hub that I keep at work for network debugging stuff, when a port mirroring switch just won't do.
Almost impossible to get hold of now.
Same here... I´ve been seing things like this everyday, for years (Network / NetSec Support Engineer here :))
Just today: Servers with a static IP address getting it changed to an APIPA just because some other 3rd party device sent a unicast ARP request to it that should've been a broadcast as per RFC4436. I personally find ARP to be one of the funniest protocols out there and I love the faces people make when they understand how ARP glues L2 and L3 and suddenly everything makes sense.
It is funny to see articles like this, because the network seems to be the "last frontier" for IT companies/workers. Only a handful seem to be brave enough to work on it (and actually like it). It is supposed to work like the lights when we switch them on and any interaction with a network engineer is just to tell them about an issue :)
In my opinion, the network guys are the geeks among the geeks. Don't get me wrong, it is easy to understand networking, it just takes dedication.
The "it works after traceroute" line was the moment I was sure it was ARP caching. Then I was just wondering if it had anything to do with AWS specifically, which could have pointed towards a serious VPC bug. That doesn't seem to be the case though. You're not weird. Just burnt by real experience.
I guessed this as soon as he mentioned spinning up/down a lot of machines using scripting. I've seen the same thing happen within an OpenStack environment I used to maintain.
As part of the cloud-init scripts on new machines we would install arping and have the newly provisioned VM automatically send gratuitous arps so that arp caches would get updated.
I haven't seen the issue on my RHEL 7 VM's though, nor in my current environment, but we don't spin up/down as many VM's as we used to.
I've reached the point that flushing the ARP cache is part of my network troubleshooting routine. Always gives you the weirdest hardest to trace errors.
Until that time when you flush the cache on Solaris[1] and remove the entries for its own interfaces. Then it stops responding to ARP requests for itself. Now you have an even more fun issue to troubleshoot!
1. something silly like: "arp -a | awk '{print $2}' | xargs -n 1 arp -d". It's been a while and I don't think I was the original culprit.
And how do you fix that? I'm no Solaris guru obviously, but this looks like a chicken-and-egg problem. How do you actually go about fixing this? It's sounds very... Kafkesque
Actually if you think about it, you can reach the server even if it's not replying to ARPs as long as it's initiating connections to other devices on the same network. Say A is the server in question (not responding to ARPs) and it initiates a connection to B. In this case, B will learn A's MAC, and so B will be able to reach A until A's entry times out of B's MAC table. The router for that network also has A's MAC, but you may not have access to the network gear.
The intermittent connectivity to server A is what makes this problem so fun to diagnose. :-)
Don't feel bad; as soon as I read the bit about them using a small amount of IP space, and new instances getting assigned IPs that were recently used by terminated hosts, I thought the same thing.
Interesting trace down to stale ARP entries. It gets worse when the switches are running mac address filtering and they get out of date. We had that issue with some Blade G8052 top of rack switches with their upstream 10G ports. They sometimes "forget" which upstream port has the MAC address that they are switching too, and those packets just spew out messily into the data center leaving a mess. The "fix" it to force the switch to ping up through a specific upstream port periodically to the center switch's management IP address. Sigh.
I've seen the exact same issue in some Dell PowerConnect switches.
We only saw the issue when using the "iSCSI Auto-Config" mode. Manually configuring the switches with the same config but entered by hand resolved the issue.
This reminds me of a time at a previous company years ago, where we experienced an issue that felt similar, although the root cause was quite different.
Basically, we had multiple teams all launching/terminating web servers. Unfortunately, they were all in the same EC2 deployment, and more often than not our load balancers from one team would send traffic to the web servers of another team. Furthermore, our setups were similar enough that this would sometimes cause bad results for users. We fixed it by making sure that our web servers on every team spoke on different ports. Not elegant, but effective (until two teams accidentally picked the same ports).
These days we have good enough infrastructure tools that this problem should never happen. But in 2009, at a company that was overwhelmed with growth, those sort of things happen.
Try an arping from the new workers on first startup? Ran into this quite a bit when using VIPs for DB failover, and an arping fixed the caching issue in most cases.
I think the biggest thing I was surprised by with this investigation was the lack of documentation about data-layer tools. At one point I was looking through the source of the 'ip' command to try to find out exactly which conditions caused a 'STALE' entry in the ARP table...
This particular problem would not crop up on Google Compute Engine. See this cryptic snippet from /sbin/ifconfig on a production GCE VM:
eth0 Link encap:Ethernet HWaddr 42:01:0a:f0:40:dd
inet addr:10.240.64.221 Bcast:10.240.64.221 Mask:255.255.255.255
This isn't an intentional feature, it's just a property of how our Andromeda SDN[0] is wired up. In particular, we lift the business end of figuring out where the other hosts on your LAN are up into the SDN rather than relying on the guest to cache ARP entries (hence the /32 netmask).
Actually, this is intentional. Pre-andromeda, we hit ARP caching issues when testing scale for GCE. The model to let the SDN figure out the other end (and stub out things like ARP) was carried forward with Andromeda.
Oh, interesting. I didn't know it was actually in response to ARP issues. By the time I showed up it was just how bits were wired. Apparently I should've gone to the source :)
I think you are definitely right that IPv6 would mitigate, but I think that the real way to avoid this issue is for the kernel to make the tradeoff for you - it's so cheap to do an arp ping, you might as well on maybe every 100th packet or so. Or similar: timing out and clearing the arp entries more frequently.
As for other cloud providers, I'd be really interested to hear from Digital Ocean, Rackspace, and Google!
That would mean that one of two things would need to occur:
1) Amazon sends GARP on behalf of the guest OS, from outside the kernel
2) Amazon has hooks into the guest OS to instruct the GARP to occur
Both are very bad things with regard to customer / Amazon separation of privileges and control.
I think every network engineer was probably yelling ARP at the screen early on in this write up and is likely one of the first things those seasoned in L2/L3 would look at first.
Also, as others have stated IPv6 removes this problem altogether. With the complete gutting of the concept of a broadcast and replaced by announcements within multicast groups which are much more efficient when you're talking about "small" prefixes that house 18,446,744,073,709,552,000 hosts (/64). ARP tables wouldn't be able to scale or work efficiently.
Overall this makes me think that it would be interesting to build a best practices guide around IPv4 and IPv6 networking within cloud provider environments. I think the gratuitous ARP on boot is a relatively safe practice, especially with regard to environments that are in continual flux.
We had a fascinating bug on EC2 -- we could connect to the instance, but no network traffic made it out. It wasn't security group problems, it was literally a really weird bug in EC2's network that we somehow triggered, the engineer over at Amazon that looked at it was really excited when he came across our case as it was so weird, heh. They fixed it, I can't remember exactly what was done on their end, but it was one of the weirder problems I've attempted to debug. Nothing I tried worked!
Wouldn't it be a better solution to not reuse IP addresses quickly? If I understood it correctly, they are in a private network anyway, so they could afford it.
I thought there was a mechanism that would allow some sort of failure notification packet to make it back to the origin, so that it would then perform a proper ARP lookup WHO HAS 1.2.3.4? TELL 4.3.2.1!.
The issue is that if it is on-network the only real "failure" notification is a time-out. No ICMP packets, nothing are sent back to indicate you are sending to a non-existent IP address.
I just experienced this early this week. Very frustrating. I also posted to AWS forums and got zero assistance; am currently not paying for AWS support plan. This article came at an opportune moment -- it makes sense and removes the shroud of mystery around why it "works sometimes" which leaves me with an uneasy feeling for a production setup.
It seems to work fine everywhere, except in EC2 VPC, where the arp cache can sometime becomes stale. We too reported this issue to AWS support, but have no idea if they are doing anything about it.
The workaround is to apply a sysctl change to revert to the old behavior prior to the commit. Or to use a subnet larger than /24 to reduce the chance of getting the same IP.
The solution for us was to set this in sysctl.conf:
net.ipv4.neigh.default.gc_thresh1=0
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1331150... https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1331150...