

When your IP traffic in AWS disappears into a black hole - schimmy_changa
http://engineering.clever.com/2014/12/10/when-your-ip-traffic-in-aws-disappears-into-a-black-hole/

======
jcollins
We saw this earlier this year after upgrading to a new Linux kernel.

The solution for us was to set this in sysctl.conf:

net.ipv4.neigh.default.gc_thresh1=0

[https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1331150...](https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1331150/)
[https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1331150...](https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1331150/comments/12)

------
danesparza
Am I weird because I actually muttered 'ARP caching issue' halfway through
your article? :-)

Love the technical write-up -- thanks!

~~~
schimmy_changa
I don't think you're weird, I think you're probably a great candidate to work
as a sysadmin / SRE, or at Clever, we're hiring:
[https://clever.com/about/jobs#engineer-distributed-
systems](https://clever.com/about/jobs#engineer-distributed-systems) !

~~~
justincormack
Your job descriptions list a lot of technologies but dont mention anything to
do with operating systems or networking. It seems odd to explicitly call out
Node and redis but not anything down the stack.

~~~
schimmy_changa
It's a good point, but right now we don't do anything that relies specifically
on one technology at the very low levels (for instance, kernel modules or very
specific networking optimizations we _don 't_ need and would be over-
optimization).

Instead we try to maintain a SOA which is agnostic to what underlying tools
back our services, something that we've found Docker is useful for.

That being said, we generally just use Ubuntu, and have some other AWS-
specific setups around networking. If you like hacking on Cisco routers, I
think it's perhaps not the right position :)

------
ChuckMcM
Interesting trace down to stale ARP entries. It gets worse when the switches
are running mac address filtering and _they_ get out of date. We had that
issue with some Blade G8052 top of rack switches with their upstream 10G
ports. They sometimes "forget" which upstream port has the MAC address that
they are switching too, and those packets just spew out messily into the data
center leaving a mess. The "fix" it to force the switch to ping up through a
specific upstream port periodically to the center switch's management IP
address. Sigh.

~~~
p8952
I've seen the exact same issue in some Dell PowerConnect switches.

We only saw the issue when using the "iSCSI Auto-Config" mode. Manually
configuring the switches with the same config but entered by hand resolved the
issue.

------
spectre256
This reminds me of a time at a previous company years ago, where we
experienced an issue that felt similar, although the root cause was quite
different.

Basically, we had multiple teams all launching/terminating web servers.
Unfortunately, they were all in the same EC2 deployment, and more often than
not our load balancers from one team would send traffic to the web servers of
another team. Furthermore, our setups were similar enough that this would
sometimes cause bad results for users. We fixed it by making sure that our web
servers on every team spoke on different ports. Not elegant, but effective
(until two teams accidentally picked the same ports).

These days we have good enough infrastructure tools that this problem should
never happen. But in 2009, at a company that was overwhelmed with growth,
those sort of things happen.

~~~
vosper
Would a better fix have been to have a load balancer per team, and then have
the EC2 instance bind to the correct load balancer?

~~~
spectre256
That's what we had, but because the IPs were often recycled, traffic sometimes
went to the wrong place if the load balancers has stale data.

------
falcolas
Try an arping from the new workers on first startup? Ran into this quite a bit
when using VIPs for DB failover, and an arping fixed the caching issue in most
cases.

------
schimmy_changa
I think the biggest thing I was surprised by with this investigation was the
lack of documentation about data-layer tools. At one point I was looking
through the source of the 'ip' command to try to find out exactly which
conditions caused a 'STALE' entry in the ARP table...

------
maerF0x0
I wonder if this is a problem for any cloud provider, I also wonder if ipv6
could help mitigate? Then the IP collusions would be rarer.

~~~
schimmy_changa
I think you are definitely right that IPv6 would mitigate, but I think that
the real way to avoid this issue is for the kernel to make the tradeoff for
you - it's so cheap to do an arp ping, you might as well on maybe every 100th
packet or so. Or similar: timing out and clearing the arp entries more
frequently.

As for other cloud providers, I'd be really interested to hear from Digital
Ocean, Rackspace, and Google!

~~~
jsolson
Answered parallel to you in this thread:
[https://news.ycombinator.com/item?id=8732356](https://news.ycombinator.com/item?id=8732356)

------
wahnfrieden
FYI, Clever: I click "Engineering Blog" at the top, and all links to blog
posts on that page 404.

~~~
rgarcia
sorry about that. It's fixed now!

------
girvo
We had a fascinating bug on EC2 -- we could connect _to_ the instance, but no
network traffic made it out. It wasn't security group problems, it was
literally a really weird bug in EC2's network that we somehow triggered, the
engineer over at Amazon that looked at it was really excited when he came
across our case as it was so weird, heh. They fixed it, I can't remember
exactly what was done on their end, but it was one of the weirder problems
I've attempted to debug. Nothing I tried worked!

------
perlgeek
Wouldn't it be a better solution to not reuse IP addresses quickly? If I
understood it correctly, they are in a private network anyway, so they could
afford it.

~~~
daveloyall
I thought there was a mechanism that would allow some sort of failure
notification packet to make it back to the origin, so that it would then
perform a proper ARP lookup _WHO HAS 1.2.3.4? TELL 4.3.2.1!_.

Perhaps AWS is blocking some class of packet?

~~~
X-Istence
The issue is that if it is on-network the only real "failure" notification is
a time-out. No ICMP packets, nothing are sent back to indicate you are sending
to a non-existent IP address.

------
zenocon
I just experienced this early this week. Very frustrating. I also posted to
AWS forums and got zero assistance; am currently not paying for AWS support
plan. This article came at an opportune moment -- it makes sense and removes
the shroud of mystery around why it "works sometimes" which leaves me with an
uneasy feeling for a production setup.

~~~
schimmy_changa
awesome, I was hoping it could help out someone else :)

------
kiyoto
Looking at the port number, it looks like Clever is a MongoDB user =)

~~~
akx
You, uh, don't even have to just recognize the port number, reading the
article is enough...

> [...] These workers all connect to a MongoDB replicaset to update and read
> data [...] > 27017 = port mongod is listening on

------
legohead
I was going to respond with my little story, but I see your article already
linked it! ;)

