
The little ssh that (sometimes) couldn't - LiveTheDream
http://mina.naguib.ca/blog/2012/10/22/the-little-ssh-that-sometimes-couldnt.html
======
mikeash
Well, while we're at it, here's my crazy MTU-related war story, although not
as crazy as that one!

I was troubleshooting with a user of an audio streaming application running
over a LAN. The user could stream classical music but not rock music.
Seriously. Classical was fine, but when streaming rock, the connection would
drop after a few minutes.

The application took chunks of audio, compressed them with a lossless codec,
and then sent each chunk in a separate UDP packet to the other end. It tried
to use IPv6 whenever possible because it was generally more reliable in the
LAN environment, although it would happily use IPv4 if need be.

After a _huge_ amount of boring troubleshooting going back and forth with this
guy, I finally figured it out. Somehow, he had set his network interface's MTU
to 1200 bytes. IPv6 won't perform automatic IP-level fragmentation for MTUs
below 1280 bytes, so larger packets simply could not be sent at all. The
streaming application would try to send an audio packet larger than 1200
bytes, get an error, and bail out of the connection.

Why did it only happen with rock music? Turns out to be pretty simple.
Lossless codecs are necessarily variable bitrate, and classical music
compresses better than rock music. When streaming classical, each chunk of
audio consistently compressed to less than 1200 bytes, but rock music produced
occasional packets over the threshold.

The user didn't know why his MTU was turned down and didn't need it, so we
turned it back up and everything worked just fine.

~~~
dhoe
I love this sort of insane-sounding problem descriptions. Here's one that
happened to me: WiFi disconnects when I visit Gmail, and doesn't reconnect
until I reboot Linux. Cause: Gmail's chat thingy uses Flash, which for some
reason does some webcam initialization, which triggers a bug (in combination
with crappy hardware) in the uvcvideo kernel module which leads to a timeout
which leads to the whole USB bus going down. Which includes the WiFi chip.

~~~
mikeash
How long did that take you to figure out? I imagine that would drive me
completely mad.

------
js2
This is insane. The closest scenarios to this I've seen in my career:

1) A private frame relay network that one day stopped passing packets over a
certain size. Worked around by lowering the MTU at both ends till I was able
to convince the frame relay provider that yes, the problem was in their
network. This was relatively straight-forward to diagnose, but it was still
odd being able to ssh into a box, then have the connection hang once I did
something that sent a full-size packet (cat a large file, ls -l in a big
directory, etc).

2) A paging gateway program I wrote (email to SMS) that worked fine when
testing on my Mac, but couldn't establish connections to a particular Verizon
web site when I ran it from a Linux box. Turned out that the Linux TCP stack
had ECN enabled and at the time the Verizon website was behind a buggy
firewall that blocked any packets with ECN bits set.

3) A Solaris box that could randomly be connected to, but not always. Turned
out someone had deleted its own MAC address from its ARP table (yes, you can
do this with Solaris) so it wasn't replying to ARP packets for itself. As I
recall, it could make outbound connections, and then you could connect to it
from that same peer until the peer timed out the ARP entry. Then the peer
couldn't reach the Solaris box again.

None of these are nearly as complex as the scenario in this story.

~~~
derleth
> someone had deleted its own MAC address from its ARP table

 _blink_

Two questions:

\- Is there _ever_ a valid reason to do this?

\- How do you attain the skills required to do this while not also learning
_not_ to?

~~~
cnvogel
So, I was working at a very small internet service provider in a rural area in
the mid-nineties. For the lack of affordable hardware, we were using Linux
machines for routing, and a lot of "unconventional" solutions were necessary
due to insufficient hardware being used. Tunelling, and other virtual
interfaces of any kind were used often.

I remember one particular case were we running both routed IP and bridged
ethernet over a single frame-relay link, and there we had to resort to fixed
ethernet-to-ip mapping (turning off ARP) on the bridged link for some reason I
really can no longer remember.

------
gwright
Reminds me of a problem I had with a T1 circuit corrupting packets.

Shortly after bringing up a second T1 into a remote location we discovered
that some web pages would show broken JPG images at the remote site.

Some troubleshooting revealed that this only happened when traffic was routed
over the new T1. The old T1 worked just fine. Pings, and other IP traffic
seemed to work over either line but we kept seeing the broken image icon for
some reason when traffic came over the new T1.

We tried several times to confirm with the telco that the T1 was provisioned
correctly and that our equipment matched those telco parameters. Still had
some mangled bits going over that new T1.

Finally had the telco check the parameters over every span in the new (long-
distance) T1 circuit and they eventually found one segment that was configured
for AMI instead of B8ZS (if I can remember correctly, certainly it was a
misconfigured segment though).

The net result is that certain user-data patterns that didn't include
sufficient 0/1 transitions would lead to loss of clock synchronization over
that segment and corrupted packets. Those patterns were most likely to occur
in JPGs.

Once they corrected the parameters on that segment, everything worked as
expected.

Quite a bit of head scratching with that one and lots of frustration as the
layer-1 telco culture just couldn't comprehend that layer-2/3 Internet folks
could accurately diagnose problems with their layer-1 network.

~~~
NateLawson
One of my fellow ISP admins (silicon.net, I was elite.net) had the same
problem around 1995. GIFs would load on web pages but not JPEGs. They'd load
for some amount of time and then hang.

The way he diagnosed it was to do a transfer of /dev/zero out one link. It
worked. But it stopped almost immediately out the problem link. It turned out
to be the same problem -- no zero bit stuffing configured on the line.

By the way, this same technique, known as "weak bits", was used as a floppy
and CD/DVD copy protection scheme.

------
ChuckMcM
That is an awesome story. If you're in devops I would suggest you look at the
sequence of events, especially the debugging decision tree. You can't always
get access to all of the machines but you can create 'views' by going through
them. Sort of like astronomers using a gravitational lens.

We had a similar issue at Blekko where a 10G switch we were using would not
pass a certain bit pattern in a UDP packet fragment. Just vanished. Annoying
as heck, the fix was to add random data to the packet on retries so that at
least one datagram made it through intact.

~~~
bowmessage
Sounds more like a workaround, less of a fix :P. Weird stuff!

------
alexkus
On a related note I used to be the person who got to go see customers who had
problems with our software that the support desk couldn't solve. This often
meant one or two day trips to glamourous industrial estates on the edge of
various cities all around the world.

About 3 visits in a row I went to look at problems (core dumps or errors) that
the customer could reproduce at will, only for them to be unable to replicate
the problem with me present on site.

I sat at one customer (in sunny Minneapolis) for 2 hours in the morning with
the customer getting increasingly baffled as to why he couldn't get it to
fail; it had been happily failing for him the previous evening when I was
talking to him on the 'phone. We gave up and went for lunch (mmm, Khan's
Mongolian Barbeque). A colleague of his called him midway through lunch to
tell him that the software was failing again. Excellent I thought, we'll
finally get to the bottom of it. Back to their office and ... no replication;
it was working fine.

As a joke I said I should leave a clump of my hair taped to the side of the
E450 it was running on. The customer took me up on that offer and, as far as I
know (definitely for a few years at least), the software ran flawlessly at
that customer.

It's the closest I've got to a "'more magic' switch" story of my own.

------
SoftwareMaven
This is why good DevOps people are worth their weight in platinum. As a
developer who has done just enough administration to be dangerous, I can
easily say that my job is always far more enjoyable when there are good DevOps
folks around to keep my systems happy and shield me from the crazy place that
is the Internet's wiring.

------
greenyoda
The striking thing about this story is that even after the problem was solved
by re-routing traffic around the bad hardware, the author continued to
investigate until the ultimate cause was tracked down. This almost obsessive
desire to understand the true causes of problems (whether they be related to
operations, software development, or whatever) is one of the things that makes
people really good at what they do.

------
unimpressive
Reminded me of this:

<http://www.ibiblio.org/harris/500milemail.html>

~~~
caf
The numbers in this story do not add up. For a connection to be made within
3ms, there needs to be a round-trip within that time, which reduces the
maximum possible radius by half.

~~~
sirclueless
Also, don't both photons in fiber and electrons in copper wire travel around
60% of the speed of light?

~~~
alexkus
In most fiber optic cable photons travel at about ~65% of c. The exact number
depends on the refractive index of the fiber.

For electricity: Paraphrasing
<https://en.wikipedia.org/wiki/Speed_of_electricity>

The 'Current' travels along a wire at anything between 97% and 60% of c
(depending on insulation; more insulation == slower).

Individual electrons in copper wire travel much more slowly. The 'drift
velocity' is roughly proportional to the voltage; for low voltage DC it is the
order of millimeters per hour.

For AC voltage individual electrons don't have any net movement since they're
oscillating back and forth with the alternating current.

------
swordswinger12
I love reading weird bug stories like this. Is there a place where lots of
these types of stories are aggregated? Maybe a book about them?

~~~
rachelbythebay
Does it count if they're weird misconfigurations which may or may not have
been set by a human?

<http://rachelbythebay.com/w/2011/08/16/window/>
<http://rachelbythebay.com/w/2011/07/02/ninja/>
<http://rachelbythebay.com/w/2012/01/08/blackhole/>
<http://rachelbythebay.com/w/2012/01/18/alertstorm/>

There are many more. Unless specifically noted, they all happened to me at
some point.

~~~
swordswinger12
Thanks!

------
soldermont001
A dev submitted code and broke our build once, when we looked at what he
submitted there appear to random syntax errors in it. On his workstation the
code was correct however.

We tracked it down to a switch that was corrupting packets enough that the TCP
checksum wasn't sufficient protection, and the packets would simply pass their
checksum despite having been altered.

The out come was that we always use compression, or encryption, as an added
layer of protection.

------
jwr
Hey, I had this problem! Exactly the same symptoms, although I never got as
far as dumping actual packet contents. But I did verify that packet loss (of
various sizes) was not the culprit. It was SSH (and some monitoring TCP
connections) that failed (hung), always precisely at the same moment.

I suspected the VM code at the time, but it is very likely that my packets had
to go through the same router (geography would support this).

I'm so glad somebody debugged this problem. Also, I'm quite glad that at least
this time I'm not the only person with a weird issue (I have a knack for
breaking things).

------
ComputerGuru
Off-topic: can someone provide a good reason why SSH w/ the HPN patches is not
the default for every SSH install on every platform?

Today, people are relying on SSH for binary transfer more than ever. SFTP and
SCP are the new defacto file transfer standards between machine to machine
over a secured connection. Source control like GIT (or even SVN) make heavy
use of binary transfers over SSH. The performance benefit to the entire world
is immeasurable. Yet unless you explicitly go out of your way to manually
compile and install SSH-HPN, you don't get it.

That said, given how slow SSH is on Windows (GIT pushes and pulls are
exponentially slower than on *nix or OS X), does anyone have a good link to a
Putty HPN build?

~~~
18pfsmt
I have to admit I felt pretty ignorant for not knowing what you were talking
about. So, for anyone else in a similar situation:
<http://www.psc.edu/index.php/hpn-ssh>

~~~
ComputerGuru
Don't. It's a very esoteric topic. Hence my frustration - I wish it weren't
so!

------
lysium
Can anybody think of an explanation why the 'bug' happened only after the
576th byte?

~~~
delinka
I loves me some speculation! Here goes:

576 decimal looks like this in other common bases:

    
    
      binary: 0000 00010  0100 0000
      octal: 1100
      hexadecimal: 240
    

My first inclination would be a firmware problem: was it upgraded recently?
Are there any known problems with the version that was installed? Did you
build it yourself? If we have access to the code, the information about the
'shape' of 576 may come in handy. Or maybe we just need to look at your build
environment.

Assuming the firmware had not changed recently (very likely in gear that sits
quietly doing its job without human intervention for long periods of time),
then failing hardware becomes the suspect. Maybe a memory module is going bad
and this particular byte is normally avoided (see intermittent failures in
OP.) Maybe it's using flash to store transitory data and a particular cell is
going bad. Maybe the unit has suffered vibration damage and a solder point
related to memory has come loose.

Some of these are far less likely than others. I think of all these, I'd put
money on a bad memory stick.

~~~
kijiki
576 bytes is the minimum datagram size that must be accepted by all nodes in
an IPv4 internet.

In practice, on the modern Internet, you'll see tons of packets larger. So who
knows; perhaps some massively outdated optimization in an ASIC somewhere
resulted in different hardware paths for >576 bytes.

------
rdl
I'd be pissed at a transit provider who mangled packets like this.

The more ambiguous situation is that early Juniper routers would fairly
frequently re-order packets. That's nominally allowed, but a lot of protocols
didn't like it.

There are way weirder things on satellite or other networks (spoofing acks,
etc.).

------
acdha
Great story - I've had MTU and firewall fun before but nothing so subtly
treacherous.

I've been wondering about something not entirely unrelated we see sporadically
from a small but widespread number of users. We serve deep zoom images and the
client appears to run normally but sends malformed image tile requests - e.g.
in the URLs "service" is consistently garbled as "s/rvice", "dzi" as "d/i".
I've seen this from IPs on every continent and user agents for most common
browsers as well as both iOS and Android. My current theory is that it's some
sort of tampering net filter as a fair number of the IPs have reverse DNS /
Whois info suggesting educational institutions but have thus far failed to
confirm this, particularly since none of the users have contacted us.

------
geofft
Awesome story.

~~~
andrewcooke
it's nice to read something like this that's been written recently. it often
seems like such war stories come from some golden past...

------
kabdib
Nice analysis.

I had a similar problem, less hairy, involving a bad bit in a disk drive's
cache RAM. Took a day or so to figure out a solid repro.

Stuff like this does happen. Handling bit errors in consumer electronics
storage systems is an interesting problem, and one that I'd love to see more
attention paid to.

------
zanny
I'm curious how the kernel was able to diagnose that a single one bit always
being fixed on the 15th of 16th bytes in the packet was corruption. That
sounds like some intense algorithmic profiling especially if its being applied
to every packet.

~~~
jtgeibel
The kernel is rejecting packets that fail the checksum verification. It just
happens that the source of the bit errors was behaving in this predictable
way.

------
devillius
I felt a weird thrill as I read through your troubleshooting saga. Great job
on finding the faulty node and nice work on documenting it.

------
windexh8er
So this is one of those times that I find myself torn between labels.

A little background... I was brought up in the network ranks, I worked as a
network / sys admin in high school, ended up working for an ISP as a junior
network engineer in college (while I went to college at one of the first Cisco
NetAcad baccalaureate programs - which was a combo of network study and Cisco
curriculum and certifications) and have gone on to work in every major
vertical since then for the past 10+ years; government, finance, healthcare,
retail, telecomm, etc. I always tell clients and potential employers that
having a network background generally gives me somewhat of an edge in the
industry I primarily focus on: security, and I generally will study and take
Juniper & Cisco tests and work on labs just to stay current. Most software
devs and security folks I've run into (keep in mind there are a lot of really
_good_ folks who have a better grasp on network than a lot of seasoned
engineers do) are generally overzealous in the thought that they truly do
understand IP from a debugging and troubleshooting standpoint.

Case in point: I interviewed for a "Network Architect" position with a very
well known online backup company (think top 4). The interview was the most
bizarre I've ever had, not that it spanned more than 5 interviews, but that
every time they positioned a complex network problem it was generally solvable
within 5 to 10 minutes of pointed questions. The software dev who was
interviewing me was baffled by how I came to a reasonable solution that took
them over a week, in some cases, that quickly - and it was pretty simple in
the fact that 1) I've seen something similar and 2) that's what I studied and
still have a passion for over the course of 20+ years (when I found the
Internet in 1991).

Most of the time when I run across a "magical" problem it's because someone
hasn't looked at it from L1 up. As this article showcases you generally have
two generic stack angles to approach it from - application back down to
physical, or the inverse. Having been in network support - by the time you get
a problem like this it's often so distorted with crazy outliers that really
have nothing to do with the problem your best bet is to start from that L1 and
go back up through the stack. Reading into the problem the author describes I
think there were some key data that was missed and/or misinterpreted. There
most surely would have been key indicators in TCP checksum errors and it was
glossed over pretty lightly in the explanation - but it's interesting that
those items of interest are often cast aside when digging into something like
this. Nobody in this thread has indicated where a bit error test or even
something as simple as iperf, or similar, would have been able to more
accurately showcase/reproduce the problematic network condition.

But back to the labels remark - I don't believe, as some people have said,
that this is a DevOps role largely. I don't mean to cut down on DevOps folks
because I think, at some level, if you're a jack-of-all in any org then that's
your role, it is what it is. However, this would be a problem most suited
towards a professional network engineer - and you don't see much of that need
in the startup space until people get into dealing with actual colo / DC type
environments, otherwise it's often very simple and not architected with
significant depth or specific use cases.

Long story short: network professionals are worth the money in the case of
design, build, fix of potentially issues that may seem complex to others, but
can be solved or found in minutes when you know what you're looking at. That
being said, I'm impressed that the OP dug into it to get to a point where he
could ask a specific person (who was probably a network engineer / tech of
some level) to validate/fix his claim.

------
dllthomas
Wonderful! Thank you, author and submitter both!

------
narpaldhillon
This is brilliant work. Thanks for sharing

------
seiji
Weird connection problems like that sound like tcp timestamps breaking things.
You can try turning it off across the board and see if your problems
immediately clear up:
[http://prowiki.isc.upenn.edu/wiki/TCP_tuning_for_broken_fire...](http://prowiki.isc.upenn.edu/wiki/TCP_tuning_for_broken_firewalls#Disabling_RFC1323_options)

~~~
geofft
This was demonstrably link-level corruption, though, so that's irrelevant,
right? That's about higher-level failures.

