Hacker News new | past | comments | ask | show | jobs | submit login
The case of the 500-mile email (2002) (ibiblio.org)
368 points by thunderbong on Nov 13, 2021 | hide | past | favorite | 93 comments



On a similar theme, I remember reading a story about a server that would crash mysteriously every couple of weeks. They eventually worked out that this happened whenever there was a new moon or a full moon, and the resulting high tide caused a battleship moored in a nearby harbour to rise just high enough that its powerful radar would interfere with the server.


On a much smaller scale, I once worked for a wireless ISP. We had a customer who called in late September saying her service had been out for a few weeks. I went to her house and discovered that she was in a wheelchair and couldn't reach the controls for her air conditioner, so she was turning it on and off using the on/off switch on a power strip that was sitting on a desk. Her router was plugged into the same power strip. So as soon as the weather got cool enough to not need the AC, she lost her internet.


I once experienced a moored ship whose satellite Internet was extremely unreliable. It worked for 2 seconds and then it stopped working for two seconds, over and over. Only time it worked reliably was when there were no wind at all. After checking satellite images and corresponding to antennas pointing angle and ships position we eventually figured out that the antenna was pointing directly towards a wind mill. So when the rotor was turning it was intermittently blocking the signal between the satellite and the antenna. Luckily they were able to move the ship 50 meters forward and magically the Internet started working again.


I recall in the early days of WiFi, the advice around ops circles was that if you were trying to bridge two buildings using wireless, you had to set it up in the summer, not the winter. Because the water in the leaves of deciduous trees is enough to attenuate the signal. So now you've gone from "it's working" to "we have to start over," or worse, "yeah we can't actually do this."


I leech the Wi-Fi connection of a nearby convenience store from my home and have noticed that it’s much harder to get a strong signal in the summer than in colder months.


Heh, i remember a story of a server somewhere in train-station in Russia (iirc) - that'd sometimes spontaneously reboot...

Turns out, this always happens once a train with radioactive waste on it passes by - causing a few bits to flip in memory and a subsequent crash...

Can't seem to find the story online tho...



What a crazy and horrible story.

Best wishes to everyone who was ok with providing irradiated cows to the plebs.


Had a customer next to railway tracks, who had some desktop computers freeze/reboot when train passed by. No radioactivity fortunately but apparently infrasound vibrations. Bought new computers with differently shaped towers or laptops and it was ok.


You remind me of a similar story, only failures would be a couple of times a year because of the positioning of the sun:

https://news.ycombinator.com/item?id=28688090

The first reply to that has a story that seems similar to the one you remember:

https://news.ycombinator.com/item?id=28689288


There must be a site with all of these stories but could not find it right now. The story about the car that would break down if you buy vanilla Ice Cream is one of my favorites. There’s also the story about the switch that is not connected anywhere but crashes the server every single time.



And, of course, the bug where OpenOffice wouldn't print on Tuesdays! [0]

(It turned out to be a bug in 'file', not in OOo.)

[0] https://bugs.launchpad.net/ubuntu/+source/cupsys/+bug/255161



> "When did this start? A few days ago, you said, but did anything change in your systems at that time?"

> "Well, the consultant came in and patched our server and rebooted it.

> Having established that--unbelievably--the problem as reported was true, and repeatable

As far as problems go, this is an easy one to solve. Accurate description of the problem. Accurate reporting of what changed. Problem is consistently repeatable.

Compared to a problem that doesn’t reliably reproduce and for which the person reporting it claims nothing changed, this one is child’s play.

But it’s still amusing to read every time.

I had something similar occur in the early 2000s on a patch release of Solaris 2.6 (I think) where the sleep call was broken and would always return almost instantly. This caused all sorts of weird behaviors on the running system.

I also recall the first time I ran into an issue with MTU on a dedicated frame relay link we had to admin our web farm in the late nineties. One day a developer reported they could login to our bastion but when they ran “ls -l” in a big enough directory their ssh connection would hang. It turned out the connection would hang whenever a packet was generated near the MTU and we eventually tracked it down to an issue with the frame relay connection. We played with MTUs until we found out what worked. It then took a while to convince our provider what was going on but they eventually replaced a line card on the far end of the connection which allowed us to re-raise the MTU to 1500.

Another fun problem I had was an email to SMS gateway I wrote for myself that worked by posting to a Verizon web form. I developed it on a Mac (probably 2002 or so) where it worked fine but when I deployed it to my Linux box on my same home network, the script couldn’t connect to Verizon’s site. It turned out the Linux box was a setting a flag on the TCP connection (ECN I think) that was tripping up Verizon’s web firewall. The work-around was disabling ECN on the Linux box.


> Compared to a problem that doesn’t reliably reproduce and for which the person reporting it claims nothing changed, this one is child’s play.

You forgot +inaccurate reporting of the problem.

We had an employee in IT at a client who would tell us "It's broken." This went on, for every report, for three years, with us asking the same follow-up questions every time. For who, in what way, when doing what, what changed, etc.

As far as I know, that individual still works there.

People like that are the best argument for why a basic income and removing some folks from the workforce would increase efficiency.


The point of basic income isn't to remove anyone from the workforce though; it's to decouple people's basic necessities from their work. They'll still want to work to get a more comfortable lifestyle. In fact it seems entirely plausible to me that the less people have to stress about basic necessities, the more they could focus on work.


Your statement converts to 'no one will be satisfied with basic necessities."

Which is false, in my experience. Some people absolutely will be, and they won't work.


Not quite. I'm not saying that won't happen for anyone. I'm saying that's not the goal, nor would it be necessarily true for the kind of person mentioned.


Fair, but I think it's a good bet to say that people who aren't (for lack of a better word) "present" at their jobs would be perfectly fine not being there, if they weren't required to be.

And I know HN is fairly skewed towards people who want more from life, but there's a lot of people who just don't.

And I feel like they'd be over-represented in an opt-out-of-work cohort, if a basic income allowed for it.

I don't imagine a person like that goes into work saying "I'm going to do a bad job today." I think they just don't value the work they do enough to try to do it well. And they'd rather not, if that were an option.


In the mid-90s I used to repair PCs. Customer brought PC in where left mouse button did not function.

Easy. Replace mouse. NOPE.

OK, software issue. Reinstall mouse driver. NOPE.

OK, deeper software issue. Replace HDD from working PC. NOPE.

OK, replace RAM? NOPE.

OK, replace motherboard and all add-in cards. NOPE.

At this point we have a different HDD, motherboard, CPU, RAM, video card and mouse. Still left mouse button doesn't work. Mouse moves fine. Right button works.

Only thing left is the case and the PSU.

Replace PSU. Left mouse button works perfectly.

FML.


This is just such a classic story describing the Murphy’s Law of working with computers. This gave me a chuckle.

Solving programming problems is sometimes similar.


Wha? Was it a PS/2 mouse? x86 system? More details please.


Serial mouse IIRC. x86 system. Probably a 386.


This reminds me a problem I'm currently having. My iPhone freezes completely sometimes when I ride BART, requiring a hard reboot. I notice it happens when passing the Daly City station. It seems there's a strong signal tower nearby that the strong signal causes the problem. It's probably the strength level read by the hardware causing an out of bound error somewhere and corrupting the phone's memory.


Do you have an IMSI Catcher [0] detection app on your phone? I used to have the same issue (EU country), using Metro. One single stop which was above ground and near International conference centre. Evertime I went through that staion my phone would lock up. Needed Hard hard reboot (remove battery), Until I removed the IMSI catcher detection software. After that I used in flight mode using that metro line.

Edit: rooted / android / HTC phone. [0]: https://en.wikipedia.org/wiki/IMSI-catcher


I don’t have IMSI detection catcher app running. Good to hear a similar case happens, not just my phone going crazy. And it’s signal related.


Does this happen to anyone else? There aren't a ton of iPhone variants out there, so if it's a baseband-level defect, it would be happening a lot.

I'll also say, if it's just the signal strength being too high, it seems unlikely that would cause memory corruption. The signal strength is probably just an integer, and there aren't any operations defined on integers that involve using other bytes of memory. (If you have an uint8 and add 1 to 255, you just get 0; it doesn't upgrade the integer to a uint16 and overwrite adjacent memory.)


It’s an iPhone XS model. Occasionally I would get a “no carrier detected” error at the location while the signal bar is full. Looks like it’s signal related and it’s the path of handling error or exception.

Also does iPhone use ECC memory for all chips? Another possibility is my phone’s EM shielding is not good, and a strong EM/microwave signal scrambles the memory in one of the chips, probably the signal receiving chip. The freeze only happens occasionally.


Probably just someone broadcasting a new 0-day


At a very large bank here in Australia & NZ, all XML messages going through the main message bus had a trailing space character appended to the end, which broke XML validation on the receiving endpoint.

So the solution was for all endpoints to trim the very last character - not just if it was a space, but to chop off the last character. Apparently this had been the solution for years.

This worked really well until one day someone (probably a new grad) saw the character issue and figured they'd fix it.

A bank-wide P1 incident occurred because every single XML message was now unparsable due to the malformed closing '</xml ' tag. Every single application in the bank had to do an emergency update on its XML parser.


Isn't XML with trailing whitespace still valid XML?


Not if your validation is `/</xml>$/`.


Why didn't they just rollback the fix instead?


What do you mean "rollback". You say this like this bank had version control and it wasn't just recently introduced.


How do you rollback typos in messages you've already sent?


D'oh! How did they not think of that



Past related threads (less than I expected given how often it has been reposted):

We can't send email more than 500 miles (2002) - https://news.ycombinator.com/item?id=23775404 - July 2020 (135 comments) (<-- thanks ayewo for finding this!)

The case of the 500-mile email (2002) - https://news.ycombinator.com/item?id=14676835 - July 2017 (56 comments)

Every time we lift a pallet from the shipping room, the server times out (2006) - https://news.ycombinator.com/item?id=13347058 - Jan 2017 (82 comments)

The case of the 500-mile email - https://news.ycombinator.com/item?id=10305377 - Sept 2015 (1 comment)

The 500-mile email (2002) - https://news.ycombinator.com/item?id=9338708 - April 2015 (139 comments)

The case of the 500-mile email - https://news.ycombinator.com/item?id=1293652 - April 2010 (24 comments)

The case of the 500-mile email - https://news.ycombinator.com/item?id=385068 - Dec 2008 (28 comments)

The case of the 500-mile email - https://news.ycombinator.com/item?id=123489 - Feb 2008 (7 comments)


An other story (don't know if there is a nice write up) is "OpenOffice can't print on Tuesdays": https://bugs.launchpad.net/ubuntu/+source/file/+bug/248619

Edit: Here is one: https://bugs.launchpad.net/ubuntu/+source/cupsys/+bug/255161...


I know this gets posted quite often, but I still enjoy reading it every time.


Agree, I will always upvote it because it’s a good story for those who haven’t read it, and it begets really good comments and other war stories from seasoned devs.


What baffles me is that nobody seems to have called out the fact that the story is utter nonsense.

I posted a longer explanation in a lower level comment, but the biggest problem is that the author has a fundamental misunderstanding of both the speed of propagation, and how TCP works.

There is no way any of what he wrote actually happened, including "sendmail defaults to a few ms timeout."


All of this is addressed in the FAQ about the story: https://www.ibiblio.org/harris/500milemail-faq.html


I can believe there was a bizarre bug setting the timeout to zero and that there was a small delay handling the timeout leading to only quite low latency connections working. I don't for a second believe that the author didn't intentionally dream up a large chunk of this story. The FAQ claims that it wasn't actually 3 milliseconds and that he came up with this time by calculating the "adjusted" time for a given distance. But then he also claims that he took the timeout and used units to confirm the 500 miles. This is a blatant tautology, the speed of light had very little to do with this, the "epiphany" moment was definitely made up. Then there's the handwaving away the problem that sendmail V5 wouldn't have worked at all with a V8 config file. "Oh Sun must have patched their version to run like that".

That whole FAQ reads very much as "I didn't lie, I just made up all the details". I'm less convinced of the veracity of the story after reading his explanations of all the holes in his story.


The original author offered additional explanations on HN when the article was re-posted in 2015: https://news.ycombinator.com/threads?id=TreyHarris


You somehow forgot this one that made a showing in 2020:

We can't send email more than 500 miles (2002) - https://news.ycombinator.com/item?id=23775404 - July 2020 (136 comments)


Added above—thanks! No idea how that escaped my search (which was something like https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...)


I would enjoy seeing a "greatest hits" list of pages that are repeatedly submitted and discussed here.


I think I read about this for the first time on a dialup BBS in the 90s :)


I tried to read about it on that BBS, but it was more than 80 miles from my house and I couldn't afford the long distance phone charges.


You obviously didn’t have a Novation Apple Cat modem that could wardial random 5 digit calling card codes for 3rd party long distance carriers.

I really hope there are a few other people here who know what that sentence means or I’m officially ancient :(


I almost feel younger people will not understand what you are referencing :) I was SysOp of FidoNet node.


A classic and wonderful piece of internet lore. If I ever have kids this is one of the ones I'll be telling around the campfire. The one about the internet going down because the delivery truck blocked LoS is a good one too.


Could you please link to the truck story? I can’t find it.


No matter how many times this gets posted, I make sure to read it. Such a good story, especially the ending using oft unused unix tools


It's a lot like the SR-71 speed check story for me.


While that's great I like the low-flyby-almost-crash a bit better.


Could you build this into a protocol?

Like an ssh setting that only allows incoming connections that can prove (well, suggest) their proximity by a series of latency tests?


You could but it's much easier to get servers in a near-by area with AWS and other easy virtual hosting providers.


There's no reason you couldn't, but distance is not the only source of latency, so you're unlikely to find an existing case of someone doing that intentionally.

Easy enough to whitelist geo-ip matches or large net block ranges for a similar result.


You can set a TTL limit in the kernel. Close enough to a latency limit.


TTL checks are sometimes used by mobile providers to try to prevent tethering.


that is so foul


That's actually used for neighbor discovery on IPv6 to unconditionally limit it to unrouted packets.


I think DRM would be a much better use case than security (SSH).


  $ units
  586 units, 56 prefixes
  ...
Can someone share/point to a larger units file I could load. The one in the post has 1311 units, 63 prefixes. Notably, I do not have millilightseconds



This file is actually interesting to look at. It almost reads like a history book about measurements.


You could check which one Fedora uses. It has millilightseconds.


See also: https://500mile.email/ - curated list of absurd software bug stories


It happened to me, too. Back in 1995 I was in charge of the Sparc server that handled email. I got a call telling me the we couldn't send mails outside Spain. Back then, we had a slow internet connection (128K if I remember well) and sometimes the academic network had issues speaking with the outside world. Two days later we had more complaints. This time it couldn't be a network issue. We had the same problem: one OS upgrade made sendmail use a default config, not ours. Fortunately mail didn't bounce, and after the fix the server was above 20 load average for two days.

Good news was no spam came that week.


Less of a crazy bug than a funny one: I had a friend named Peter March. When his pay check fell on April 1st and was made out to Peter April he obviously thought it was an April Fools joke. It wasn't.


I worked for a company where a proxy server was used for all internet access. Every now and then a pages would not load. Error logs pointed me to the usual culprit - dns. When looking at dns traffic in tcpdump everything looked normal, except some dns replies came from rfc1918 addresses instead from the dns servers public IP address. When I talked to the ISP, they blamed me (the proxy) for reusing UDP sockets, and it was by design that their load balancer would only support one DNS request at a time per UDP socket. If there were two or more in flight, only the first response would be NATed properly. Luckily, I knew that the ISP used the same proxy product internally, so when I asked how they configured their proxy to avoid this issue, they fixed the load balancer within the hour


In a similar vein: sysadmin war story "the network ate my font".

http://verticalsysadmin.com/blog/sysadmin-war-story-the-netw...


If you like reading such debugging stories, check out danluu's repository: https://github.com/danluu/debugging-stories


We have a banking website which refuses to login when I connect on the 5G Wifi but allows me to login when I connect on the regular 802.11 WiFi (non-5G mode). How does the website login know which WiFi speed am I connecting on?


Could it be that one of the networks assign IPv4 or IPv6 and the other doesn't, and you therefore end up hitting different IPs?


Perhaps if they try to call JavaScript functions that are not available yet… I would check the developer console for errors


Back in the day, I had a Nokia E71 smartphone that I used to keep next to my work provided laptop.

My laptop would freeze for a couple of seconds right before each incoming call. Every single time.

It wasn't all that baffling to me so I decided to test the thing while placing the phone on top of my huge desktop tower. My over clocked computer simply restarted itself instead of freezing. Props to Lenovo, I guess


Anyone experienced an old VisualStudio (was it with VC6 still) bug, where the compiler would flag the last line of the source file as an error, when it did not terminate with a CR/LF newline? All code would clearly look correct in the editor, yet could not be built.


I don’t remember a bug around that behavior, but flagging a missing EOL as an error is correct.

The C standard says all source files must end in a newline.


In all the best ways possible, this reads like an Asimov short story.

Sigh, I miss him so much...


While an extremely interesting read, I've read this about once a month on HN. The story keeps getting posted.


TIL: `units` TILold: `man units`


And that's why, kids, we invented configuration management tools.


This story never gets old and I hope to see it again.


A lot of this story and why it sticks is the narration. This sort of storytelling is rare amongst technical people but certainly something to aspire to.


It's a cute story but one that is utter nonsense.

Anyone who is above junior level system or network administration should be able to instantly tell, on point #2 alone. If they don't, they do not understand the basics of TCP.

1)Wave propagation through copper wiring is a bit more than half the speed of light. Strike one.

2)His mail server sends a SYN which is acknowledged by an SYN-ACK back. His mail server does not know instantaneously that the other mail server is accepting the connection; it has to wait for a reply back. A mail server that is '3ms away' even assuming speed of light transmission and no switching/routing/transmission delay would be ~250 miles, not ~500.

3)A mail server does not instantaneously reply to a SYN packet, so there's delay there. Strike three.

4)There are multiple routers involved (at least two) and especially circa 2002 routing and switching probably accounts for a significant amount of delay. Routers and switches tend to be store-and-forward; they receive the entire packet, an interrupt is generated by the NIC, a routing decisions is made, the packet is shoved into the buffer of the outgoing NIC, etc. That doesn't even cover firewalling. 3ms delay through multiple circa-2002 routers, firewalls, and switches is not believable.

5)Transmission delay. Packet take longer than just "distance divited by the speed of light" to arrive somewhere. Given typical LAN/WAN connection speeds of the day, packet transmission delay would be a factor. Ie the time it takes for a complete packet's bits to be transmitted.

6)Jitter. Even the slightest jitter would have wildly affected his testing. Just 2-3ms of jitter (if we follow this guy's calculations) would result in undeliverable mail to almost anywhere, yet he claims mail within the radius was reliably delivered. The campus network (particularly inter-building links), internet uplink, backbones, other network, and other mail server all have jitter that combined would easily exceed 2-3ms.

I also find it extremely difficult to believe that sendmail on SunOS defaults to a 3ms timeout and lacks a sane default when not specified; 3ms isn't remotely sane even today, much less 2002. Anything less than 30 seconds would surprise me. Lot of old software had very long timeout values because of how common slow links and systems were.

This looks like a viral story told by an unemployed sysadmin to get his "I'm looking for a job" message out.

If he truly had the knowledge commensurate with 10+ years as a sysadmin, he knew it was bullshit, and I think he might have been very cleverly looking for places where technical knowledge was lacking and thus he'd either be the smartest guy in the room or have an easy time of things.


If you haven't already, read the accompanying FAQ: https://www.ibiblio.org/harris/500milemail-faq.html


The only takeaway from that I get from that FAQ is that he's annoyed people think it's fake despite the fact that he admits he made up almost every aspect of the story.

It's an entertaining story without changing numbers to bogus ones. So why did he basically lie about everything? Because it never happened and when he concocted the story he wasn't very smart.

Notice he doesn't admit that the sole purpose was to spam "I need a job" to the mailing list?


> Routers and switches tend to be store-and-forward; they receive the entire packet, an interrupt is generated by the NIC, a routing decisions is made, the packet is shoved into the buffer of the outgoing NIC, etc. That doesn't even cover firewalling. 3ms delay through multiple circa-2002 routers, firewalls, and switches is not believable.

this acutally hasn been true for high level routers since the late 90's.

On an Juniper M and MX series device for instance, a packet is broken up into a "parcel" (the first 256 bytes of an IP packet) and is saved in shared memory. Further packet processing is done by dividing up parcel lookups to the routing table across different linecard processors. (depending on the linecard, this can differ wildly).

This setup has been in place since the late 90's, although not on "lower end" hardware. As far as i am aware, cisco has a similair mechanism in their higher end products for atleast a decade or two.


Seen this a few times but I'll leave you with this:

https://xkcd.com/1053/




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: