

Ask HN: How do you troubleshoot network problems? - ilurk

Is there a good source of information on how to tackle network problems or is this an hopeless ad-hoc skill?<p>After reading this recent thread [0] I have to wonder how the heck would I troubleshoot this!?<p>I know the basics:
- check if the NIC module is loaded
- check if you have an IP
- check that you can ping the gateway
- check that you can ping the your target
- check that you can telnet to port
- check iptables (or temporarily flush all and set default to accept)
- try to see something that stands out in wireshark<p>But how do you go from there?<p>For example, one friday ago, during wee hours I lost connection to the target host. I was ssh multi hopping and for some reason my immediate thought was &quot;the target host &#x27;died&#x27;&quot;. &quot;But wait a sec, let&#x27;s go one step at a time&quot;.<p>It was in fact the connection to my first hop at our local network. So I went to the physical machine. I had an IP. And every time I restarted the network I always got an IP. But I was unable to ping the gateway... 90% of the times. Then it came to me that some months ago other people on the same subnetwork had complained in the past about temporary network loss. I restarted the switch but it didn&#x27;t help.
My current theory is that of a faulty network switch, but this is yet to be confirmed.<p>The problem I had&#x2F;have? looks to be way simpler than the one mentioned in [0], but networks feel a bit like some dark magic.<p>So any advices on improving your network detective skills?<p>[0] https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=9555057
======
mobiplayer
Always, always, always:

Simultaneous captures on both client and server. I've said it before and I'll
say it again: The truth is on the wire. Once you got that, you can walk your
way down with captures on intermediate devices until you catch where the
packets are dropped.

Once you're there it really depends on the product/OS/service/filter that's
dropping the packet.

If you're really experienced you can start building the house from the roof,
i.e. trying by changing this or that as in your head the symptoms match other
case you've seen before. If you're not, just do baby steps and you'll get
there sooner than what you think.

Of course you can check basic connectivity first, but when I hear "network
problem" I understand that has been checked already.

Edit: by the way, your problem might look like dark magic if you don't
actually want to really know what's going on. Do you know how a packet is
delivered to a host in your subnet? I guess not, but just so you know your
machine will first try to learn the destination's MAC address (again, if it's
inside the same subnet). Can you confirm you resolve your gateway's MAC
address when you try to ping? Can you confirm the resolved MAC address is the
correct one? There might be someone spoofing it to intercept all the traffic
exiting the subnet or maybe you've got a pair of routers/firewalls and their
HA setup is not working as expected. In any case, go a layer down and check
that.

------
floppydisk
Start simply and work your way up. Networks can be complex beasts with a lot
of moving parts--especially if you're moving across sub nets--and if you have
high traffic volume etc.

For your immediate problem, traceroute would be a good place to start to help
figure out where you're dying on the local network. If the switch is eating
packets and connections, the traceroute should die at the switch.

With the switch being faulty, before you assign hardware as a cause, check the
bandwidth load you're pushing across the switch. If you're saturating the link
and trying to push more stuff through the switch than it can support, it will
cause what appears to be connection loss. Also check for a feedback loop
somewhere, i.e. someone plugging a cable back into the switch creating a
packet storm that doesn't die.

Learning networks takes time and experience, and each network is a different
beast with different usage patterns, hardware, and characteristics. A couple
useful rules of thumb that I've found helpful when dealing with network
debugging are as follows:

1) Start simply. Ping/traceroute/tcpdump/nc (netcat) are your first best
friends on Linux and should be the first place you start when you're debugging
by hand. NC is a pretty sweet program because it allows you to set up
arbitrary TCP connections between two machines without having to standup a
full software stack. nc -l <portnum> to set one up. This can be incredibly
helpful if you have software that's supposed to connect over the network and
isn't work, set up a nc instance on the target port to see if there's a
connection attempt. If so, it means the bug is further up the stack.

2) Networks can be complicated beasts with a lot of interchanging parts that
interact in complex and sometimes unpredictable ways. Start simply and work
your way up in complexity when trying to ascertain cause.

3) Make sure your system logging and system monitoring are paying attention to
your network activity and issue warnings when things happen like connectivity
outages. Decent monitoring software can save a lot of debugging time because
it'll tell you want happened. Our IT guys used Icinga to monitor everything.
Bit of a PITA to setup, but worth it to be able to see what/where things were
going wrong.

4) Bandwidth isn't finite. Check your hardware to see if it's saturated. We
had several issues where our internal network traffic was saturating the
original network design causing everyone on the network to experience
absolutely torrid performance. After we rewired the network and isolated the
chatty boxes on their own switches, network performance improved drastically.
Don't assume the network in general hasn't outgrown the network design or that
burst traffic isn't overloading the setup. It's possible and also explains the
intermittent outages.

