This one has made the rounds on reddit a few times. I shared this personal story last time it came around:
In 2005 at my job we had a pretty severe problem just as unexplainable. The day after an unscheduled closing (hurricane), I started getting calls from users complaining about database connection timeouts. Since I had a very simple network with less than 32 nodes and barely any bandwidth in use, it was quite scary that I could ping to the database server for 15-20 minutes and then get "request timed out" for about 2 minutes. I had performance monitors etc. running on the server and was pinging the server from multiple sources. Pretty much every machine except the server was able to talk to the others constantly. I tried to isolate a faulty switch or a bad connection but there was no way to explain the random yet periodic failures.
I asked my coworker to observe the lights on a switch in the warehouse while I ran trace routes and unplugged different devices. After 45-50 minutes on the walkie-talkie with him saying "ya it's down, ok it's back up," I asked if he noticed any patterns. He said, "Yeah... I did. But you're going to think I'm nuts. Every time the shipper takes away a pallet from the shipping room, the server times out within 2 seconds." I said "WHAT???" He said "Yeah. And the server comes back up once he starts processing the next order."
I ran down to see the shipper and was certain that he was plugging in a giant magnetomaxonizer to celebrate the successful completion of an order. Surely the electromagnetic waves from the flux capacitor were causing rip in the space-time continuum and temporarily shorting out the server's NIC card 150 feet away in another room. Nope. All he was doing was loading up the bigger boxes on the pallet first and then gradually the smaller ones on top, while scanning every box with the wireless barcode scanner. Aha! It must be the barcode scanner's wireless features that probably latch on to the database server and cause all other requests to fail. Nope. Few tests later I realized it wasn't the barcode scanner since it was behaving pretty nicely. The wireless router and it's UPS in the shipping room were configured right and seemed to be functioning normally too. It had to be something else, especially since everything was working fine just before the hurricane closing.
As soon as the next time out started, I ran into the shipping room and watched the guy load the next pallet. The moment he placed four big boxes of shampoo on the bottom row of the pallet, the database server stopped timing out! This had to be black magic! I asked him to remove the boxes and the database server began to time out again! I did not believe the absurdity of this and spent five more minutes loading and unloading the boxes of shampoo with the same exact result. I was about to fall down on my knees and start begging for mercy from the God of Ethernet when I noticed that the height at which the wireless router was placed in the shipping room was about a foot lower than the top of the four big boxes when placed on the pallet. We were finally on to something!
The wireless router lost the line-of-sight to the outside warehouse anytime a pallet was loaded with the big boxes. Ten minutes later I had the problem solved. Here is what happened. During the hurricane, there was a power failure that reset the only device in our building that wasn't connected to a UPS - a test wireless router I had in my office. The default settings on the test router somehow made it a repeater for the only other wireless router we had, the one in the shipping room. The two wireless nodes were only able to talk to each other when there were no pallets placed between them and even then the signal wasn't too strong. Every time the two wireless routers managed to talk, they created a loop in my tiny network and as a result, all packets to the database server were lost. The database server had it's own switch from the main router and hence was pretty much the furthest node. Most other PC's were on the same 16-port switch so I had no problems pinging constantly between them.
The 1-second solution to this four-hour troubleshooting nightmare was me yanking off the power to the test router. And the database server never timed out again.
I had a vaguely similar problem with a node at the end of a long, but in-spec, cat5e cable periodically experiencing huge packet loss, typically at night, and it would last for about 5-10 minutes then work fine for about 10-20. The problem happened right after I added a switch at the end of the line, but the switch had been in service elsewhere and never had any problems.
It turned out the Ethernet cable was placed in a conduit that ran close to a conduit carrying a high-current two-phase 240V power line. Most of the devices on that line were running and pulling huge amounts of current all the time, so I didn't originally suspect that the line was the problem. But, there was a furnace on that line whose variable-speed blower was connected to a single 120V phase, and the unbalanced current draw on the power line created enough common-mode noise in the cat5e cable to throw off the switch. The furnace came on periodically at night to heat the building, causing the packet loss. The solution was to put a device at the end of the Ethernet cable with better common-mode rejection.
A "normal" person might have simple "turned it off and on again" and thus fixed it. Interesting how you wanted to understand what's going on and went through a lot of trouble to get there. I'd have done the same.
Well the first problem is to find the thing to turn off and on again. It's not always clear what device is misbehaving.
Also, frequently you'll need to report to the "higher ups" what went wrong. Saying "I dunno, I turned it off and on again and it works now. shrug" makes you look like you don't know what's going on. It's much better to be able to explain what happened.
In 2005 at my job we had a pretty severe problem just as unexplainable. The day after an unscheduled closing (hurricane), I started getting calls from users complaining about database connection timeouts. Since I had a very simple network with less than 32 nodes and barely any bandwidth in use, it was quite scary that I could ping to the database server for 15-20 minutes and then get "request timed out" for about 2 minutes. I had performance monitors etc. running on the server and was pinging the server from multiple sources. Pretty much every machine except the server was able to talk to the others constantly. I tried to isolate a faulty switch or a bad connection but there was no way to explain the random yet periodic failures.
I asked my coworker to observe the lights on a switch in the warehouse while I ran trace routes and unplugged different devices. After 45-50 minutes on the walkie-talkie with him saying "ya it's down, ok it's back up," I asked if he noticed any patterns. He said, "Yeah... I did. But you're going to think I'm nuts. Every time the shipper takes away a pallet from the shipping room, the server times out within 2 seconds." I said "WHAT???" He said "Yeah. And the server comes back up once he starts processing the next order."
I ran down to see the shipper and was certain that he was plugging in a giant magnetomaxonizer to celebrate the successful completion of an order. Surely the electromagnetic waves from the flux capacitor were causing rip in the space-time continuum and temporarily shorting out the server's NIC card 150 feet away in another room. Nope. All he was doing was loading up the bigger boxes on the pallet first and then gradually the smaller ones on top, while scanning every box with the wireless barcode scanner. Aha! It must be the barcode scanner's wireless features that probably latch on to the database server and cause all other requests to fail. Nope. Few tests later I realized it wasn't the barcode scanner since it was behaving pretty nicely. The wireless router and it's UPS in the shipping room were configured right and seemed to be functioning normally too. It had to be something else, especially since everything was working fine just before the hurricane closing.
As soon as the next time out started, I ran into the shipping room and watched the guy load the next pallet. The moment he placed four big boxes of shampoo on the bottom row of the pallet, the database server stopped timing out! This had to be black magic! I asked him to remove the boxes and the database server began to time out again! I did not believe the absurdity of this and spent five more minutes loading and unloading the boxes of shampoo with the same exact result. I was about to fall down on my knees and start begging for mercy from the God of Ethernet when I noticed that the height at which the wireless router was placed in the shipping room was about a foot lower than the top of the four big boxes when placed on the pallet. We were finally on to something!
The wireless router lost the line-of-sight to the outside warehouse anytime a pallet was loaded with the big boxes. Ten minutes later I had the problem solved. Here is what happened. During the hurricane, there was a power failure that reset the only device in our building that wasn't connected to a UPS - a test wireless router I had in my office. The default settings on the test router somehow made it a repeater for the only other wireless router we had, the one in the shipping room. The two wireless nodes were only able to talk to each other when there were no pallets placed between them and even then the signal wasn't too strong. Every time the two wireless routers managed to talk, they created a loop in my tiny network and as a result, all packets to the database server were lost. The database server had it's own switch from the main router and hence was pretty much the furthest node. Most other PC's were on the same 16-port switch so I had no problems pinging constantly between them.
The 1-second solution to this four-hour troubleshooting nightmare was me yanking off the power to the test router. And the database server never timed out again.