Hacker News new | past | comments | ask | show | jobs | submit login
My ISP Is Killing My Idle SSH Sessions (anderstrier.dk)
403 points by anderstrier 15 days ago | hide | past | favorite | 240 comments



This happens because there's NAT (network address translation) happening somewhere.

Without NAT the only 2 parties that need to know anything about a TCP connection are client and server.

With NAT you have this problem where the router now also has to keep track of opened TCP connections.

E.g if you have a router with local IP 10.0.0.1 and external IP 30.0.0.1 and you are 10.0.0.2:55000 connecting to 230.0.0.1:443 router will have to allocate a port on it's external interface (let's say 56000) and remember it (this is the key part). So the connection will look like this:

10.0.0.2:55000 <-> NATing router 10.0.0.1 - 30.0.0.1:56000 <-> 230.0.0.1:443

When router receives packets on 30.0.0.1:56000 it has to remember to redirect them to 10.0.0.2:55000.

Memory is a limited resource so you can't just have an unlimited number of these opened connections floating around. This also makes your router vulnerable to an attack where an attacker can just open a bunch of connections and never close them, making your router eventually run out of memory.

So the classic solution to this problem is to use an LRU cache. So when your router is close to running out of space you just drop the connection that has been idling the longest.

Unfortunately, a) some routers are less sophisticated and will still drop your connections even if you do keep-alives and such, b) no matter what you do, memory is a finite resource and if the router doesn't have a lot of RAM, connections will be dropped.

¯\_(ツ)_/¯


It's not just NATs that cause this. Stateful firewalls must also keep track connections to allow the responses for outbound requests that would not otherwise be allowed into the network. E.g. when you make a request to

www.example.com:443

From source port 12345, and you or your isp has a firewall that blocks everything that isn't explicitly allowed (this is common in corporate networks), the response could be allowed using firewall rules such as

iptables -A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

This has the benefit of being general, but the drawback is that the firewall now needs to track the connection, with similar consequences to the NAT example you have.

It's also more likely for the firewalls to time out connections rather than use some kind of LRU scheme. In my opinion the time-based eviction is more predictable, so I prefer it. (Of course once you run out of memory you still need to evict "live" connections)


Indeed. It's fairly common to mix up stateful firewalls with NAT. You can have a stateful firewall without NAT, but you can't have NAT without a firewall. It's actually the firewall that is keeping track of connections.

The big difference here, though, is carrier-grade NAT. That means the firewall is not under your control and might have a tiny state table. NAT is bad enough as it is, but CGN should never have happened. It's just depressing to think about, to be honest.

Even with IPv6 many ISPs are still doing it wrong. They'll give subscribers dynamic prefixes which means having to use unique local addresses (ULAs) in addition to their Internet routable addresses because the latter keep changing. This kind of stupidity makes people at home want to hang on to their IPv4 LANs because they seem more under their control.

If only I could get an ISP like Hurricane Electric to provide me with a DSL line at home for a reasonable price. Consumer-grade ones are all hopelessly bad.


> but you can't have NAT without a firewall

While it is true that most NAT arrangements are provided by firewalls, it is quite possible for a device to provide NAT with no other firewalling features at all, so not be considered a firewall. In this case the device would just be a router that provides NAT.

Some confuse NAT and firewalling because NAT effectively implements a default-deny-all-not-initiated-here rule in one direction which is what most home users want in a firewall.


"Some confuse NAT and firewalling because NAT effectively implements a default-deny-all-not-initiated-here rule in one direction which is what most home users want in a firewall."

To make it even more confusing what most people are confusing with firewalling is actually NAPT which is the specific type of NAT described in this thread. There are other types of NAT which don't require keeping track of state and which don't provide the default-deny-all-not-initiated-here rule side benefit.


> what most people are confusing with firewalling is actually NAPT

Yes. I should be clearer myself as just referring to NAT this way could serve to increase the confusion.

What most people just call NAT, what is offered by simple home/office routers (or APs when not in bridge mode or similar) and phones in tethered wireless mode, is actually NAPT (Network Address Port Translation), which is a subset of SNAT (Source Network Address Translation), which is in turn a subset of NAT.


Indeed. A misconfigured NAT setup can also result in some traffic being NAT'd correctly and other traffic not being NAT'd, but ultimately still leaking out onto the wire (in either direction)

Beware when you're doing pure NAT, it doesn't always do what you think!


Isn't the firewall config you describe just essentially a software NAT?


No, not at all. NAT means Network (and port) Address Translation. If you don't change the contents of packets, it's not NAT.


> If you don't change the contents of packets, it's not NAT.

Oh well, yes. I agree.

I was thinking more from the point of view of the behavior it causes: essentially establishing some sort of look-up table to verify if an incoming packet corresponds to a previously outgoing one. Where, if the entry in that table gets deleted, the incoming packets suddenly start being rejected.


AT&T's home gateways have a maximum NAT translation table of 1024^H^H^H^H8192 connections. Some websites will go past that. A torrent client almost certainly will. And, now that people are working from home, there's a good chance that having multiple computers will only make that 1024 table limit even more laughable.

EDIT: okay I'm wrong. It's 8192 connections, not 1024 connections. But still ridiculously low


Just as a FYI/aside, it is fairly trivial to root AT&T home gateways, pull the certs and use your own hardware to authenticate to the network, removing their hardware off your stack entirely except for the ONT. (goodbye internet downtime due to random uncontrolled gateway "upgrades"). You just need a router capable of 802.1x client auth.

Throughput both ways actually gets really close to what I am paying for with this configuration, where as before with the default gateway (regardless of configuration), I was lucky to see 1/2 of the gigabit speeds I have been paying for.


I have such AT&T hardware also, but you and I have very different ideas about what's trivial.

I didn't know their box even had certs, or what "ONT" is. Is there like... a written series of steps I could follow?


If you are willing to move to Ubiquiti hardware (recommended, security breach from today notwithstanding) there's a relatively straightforward bypass method where the authentication packets are forwarded from the ONT to the AT&T box but it's otherwise out of the loop, and you have fully native routing with the Ubiquiti USG (a really nice router and ecosystem).

Instructions: https://medium.com/@mrtcve/at-t-gigabit-fiber-modem-bypass-u... Github project that makes it possible: https://github.com/jaysoffian/eap_proxy

It's definitely not plug and play but I've been using this setup for a year and a half and I get my full 1gb bandwidth throughout my network with lots of hosts.


AT&T has started using a much newer gateway for new installations.


Damn, that's a serious bummer. I hope mine doesn't break anytime soon.

If you have the BGW210 gateway there is a written series of steps for root here: https://github.com/Archerious/bgw210-root As well as step by step configuration for complete gateway bypass on Mikrotik router hardware here: https://forum.mikrotik.com/viewtopic.php?t=154954

If you are stuck with the newer XG-PON hardware, it looks like you might be out of luck for now.


This is true for existing installs. But recently ATT moved to XGPON gateways with integrated ONT. You can no longer bypass these gateways. Also to my knowledge you can’t extract the certs from Pace gateways.


And, these gateways use NAT even when in "bridged mode"


You can request to go into bridge mode which will bypass the internal residential gateway (NAT).


If an ISP is NAT'ing everyone (which I've heard of referred to as an "InterNAT Service Provider"), does "bridge mode" mean you get a real public IP? How does that work with everyone else still behind the NAT?

(I have an actual end-to-end-connectable public IP from my ISP, which from the general discussion seems like an increasingly rare thing --- they keep pestering me to "upgrade" to outrageously faster yet slightly cheaper plans with a "free router included", so I suspect they are trying to get me to give up that IP...)


There are 2 different topics here. One is carrier grade NAT (CGNAT), which is used by ISPs that have run out of IPv4 addresses so you don’t get a real public IPv4 address, although you should have a public IPv6. If you’re unlucky enough to be on one of thee ISPs there’s likely not much you can do.

The other issue is ISP provided gateways that handle authentication onto the ISP network, like ATT fiber. These devices contain the certificate/keys to gain access to the network. Unfortunately theses devices also try to be more than just an auth device/gateway. In ATT’s case the gateway also handles some Uverse/IP TV services so they don’t have a true bridge mode where they send all traffic to another device. This approach then causes issues like update downtime or NAT table issues.

Either of these issues shouldn’t be caused simply by an ISP provided router. If an ISP wants to implement either approach they will do so without your approval.


> carrier grade NAT (CGNAT), which is used by ISPs that have run out of IPv4 addresses … If you’re unlucky enough to be on one of thee ISPs there’s likely not much you can do.

I had the same SSH dropout problem, asked my ISP[1] to switch me from CGNAT to dedicated IPv4; they did, and it's fixed.

[1] Aussie Broadband, a smaller ISP in Australia renowned for good customer service.


Consider sending Aussie Broadband a link to my blog post. It should be a simple fix for them to raise the timeout, which should fix the problem for all their customers.


> One is carrier grade NAT (CGNAT), which is used by ISPs that have run out of IPv4 addresses so you don’t get a real public IPv4 address, although you should have a public IPv6. If you’re unlucky enough to be on one of thee ISPs there’s likely not much you can do.

This is true. Your options look like:

1. Get a new ISP

2. Get a VPN that supplies you with a public IP (these exist)

3. Hope you can do whatever you need on IPv6 instead


Some CGNAT ISPs will also sell service with a public IPv4 for a premium. That's probably the most "user-friendly" option but it's also probably something they don't advertise and you need to ask for explicitly, if offered.


you can still get around this with some effort [1] and a pfsense box, the pfsense box gets wan from the ont and the original att router is hung off a third nic where it's allowed to do 802.1x and nothing else. the setup was a little challenging at first but has been maintenance free since. maybe there is a technical reason they have their network set up this way but i was offended at the idea of being prevented from using my own router.

[1] https://github.com/MonkWho/pfatt


The AT&T gateways do not have a true bridge mode. They still use NAT even if they look like they are just passing the connection on.


It's even more trivial with CenturyLinks Fiber. You don't even need any certs.


1k certainly seems absurdly small considering how much RAM routers likely have, the fact that they can use most of it, and the amount of data needed for a single connection table entry (2 bytes external port, 2 bytes internal port, 4 bytes internal IP adds up to 8 bytes per entry, even being very generous at 16 bytes including overhead, that's still only 16K --- on a device that likely has several MB if not more, and whose primary function is likely NAT.


Some providers do this to force you to upgrade to business plans. Comcast business though, at least a while back, still had a limit too low for the office I worked. We switched to ATT business fiber and used our own GW.


> Some websites will go past that.

Do you literally mean a website? Using a browser? What’s an example website that would go past that?


You can overwhelm a NAT in several ways.

UDP is connectionless, but typically a UDP communication is bidirectional. This means a NAT needs to inspect UDP packets and retain a mapping to direct incoming UDP packets to the right place. With no connection information this can only be done as an LRU cache or similar.

TCP is connection oriented, and a NAT might rapidly free up resources when a connection is closed (ie, when the final FIN has been ACKed). But if there's no FIN, the NAT is in the same case as it is with UDP. Making a lot of connections without closing them fills up NAT buffers.

When you have a home NAT and a carrier-grade NAT you may get an impedance mismatch of sorts. The CGNAT might have insufficient ports allocated to your service to keep up with your home NAT, resulting in timeouts or dropped mappings. Your home NAT will have one set of mappings and the CGNAT another, and the two sets probably won't be exactly the same. This means some portion of the mappings held in memory are useless.

As a specific example, many years ago Google Maps would routinely trigger failures. Using Maps would load many tile images, which could overwhelm a NAT or CGNAT. The result was a map with holes in it where some tiles failed to load.

Browsers have long had limits on concurrent connections per domain. Total concurrent connection limits are also old news, but are not quite as old as per-domain limits. You probably can't make a NAT choke with just simple web requests (even AJAX) any more. You might be able to do it using, eg, WebRTC APIs, though I would be surprised if those aren't also subject to limits.


I remember being able to overwhelm my first "home router" with the "Browse for servers" tab in Counter Strike 1.6! It would fetch a list of all servers from Steam, and then connect to them individually, eventually killing my router.


"Using Maps would load many tiles images, which could overwhelm NAT of CGNAT."

Just curious, were these image resources all hosted on the same domain?


No, and that's by design, as many browsers limit you to two http-connections per domain. When you're loading tens of images (like map tiles), you want to use as many different subdomains as possible to load them in parallel.


With HTTP/2 one is way better off using one connection to one domain instead.

How has the world changed.


For many years before "HTTP/2", I have been using HTTP/1.1 pipelining, outside the browser, to download hundreds of files over a single TCP connection.


I'm afraid I don't recall. I suspect that they could not have been, based on best practices for performance at the time and the fact that the problem existed at all. I did, however, find a reference to the problem:

https://meetings.apnic.net/32/pdf/Miyakawa-APNIC-KEYNOTE-IPv...

Slides 12-15 show the degradation of Maps in action. 20 connections per user is a heavily over-committed CGNAT, but that level of port sharing does happen.


But the "connections per user" limit is per-webserver. You'd have to have thousands of users simultaneously loading maps off the same google server just to run out of ports on one IP.

I bet you could put 10k people behind each IP and never even get close to an issue of this type.


Carrier grade NAT puts thousands of users behind the same IP address - that's what it's for.

You can't put 10k people behind each IP and not have problems. That's 6.5 ports per person, you need one for each connection. Pretty much any website will have issues with that little connectivity.


> Carrier grade NAT puts thousands of users behind the same IP address - that's what it's for.

It doesn't have to. 100:1 would work just fine. With IPs being about $25 each that's an acquisition cost of less than a dollar per user.

> That's 6.5 ports per person, you need one for each connection.

That's not how connections work. Each user could make a million connections as long as they're spread around different servers. The 65k limit applies to simultaneous connections to a single webserver. Only the most-connected server matters, so probably something at google/youtube/facebook, and even then most of those servers have multiple IPs.


wouldn’t websockets be impacted by this limit?


Yes but I've yet to see a website use more than 10 simultaneous websocket connections, let alone 1000.


There's something like a 256 count limit on total websockets, and 30 per domain, in Chromium.

A malicious website could open up 256 websockets and as many HTTP connections as the browser allows, and that might be enough to swamp cheaper NATs.

See https://bugs.chromium.org/p/chromium/issues/detail?id=12066 for some 2009 discussion about people having troubles using the web when background tabs held connections open for polling. That wasn't a NAT issue, but it does highlight that a decade or two ago we all thought we only needed to manage tens of connections for a host to be online but that rapidly spiralled into hundreds.


I know it's not 2002 anymore, but I'm pretty sure no website on this planet would even come close to 1000 open connections, unless it actively tries to achieve just that, but even then I think browsers still have a limit on number of concurrent open connections, per tab and maybe total.


I also was very surprised about that number, so I checked with tcpdump and google maps on a new browser instance: I count just 31 syns after zooming in, moving and clicking on a pub :?



I'm pretty sure this guy is not using anything from AT&T though, as the chances seem really good he's in Denmark.


I just want to say thank you, this was a very concise explanation of a very complex concept.

I've been working with NATs for years and your comment helped me "click" and understand them at a different level.


Similarly, I found that OP's article provided an excellent primer on many concepts -- it certainly clarified the relationship between NAT and firewalls: that is, the latter being somewhat of an unintended consequence of the former.

Stumbling upon a great blog post that makes something click is always a pleasant experience.


Thanks! I guess I should keep blogging then :)


Me too. I’ve always wondered how a NAT knows where to route traffic. I figured it would use a lookup table, but I never know what the “keys” were. For some reason, using different ports for each device behind the NAT never crossed my mind! I knew it couldn’t be done by adding routing data to the packets (which is what IPv6 ended up doing) because that isn’t sustainable over multiple NATs. A port based routing with a table makes so much sense! It also explains why idle sessions are dropped.


"Without NAT the only 2 parties that need to know anything about a TCP connection are client and server."

Even without NAT there may be multiple devices between a client and server which need to know about the TCP connection. Stateful firewalls, WAN accelerators, and load balancers are some examples.


They should all be under your control, though. Once you hand off to your ISP it should be nothing but routers all the way.


I wish they were all under my control.


as noted ITT:

set ServerAliveInterval in your ssh_config to avoid this (default uset)


[mosh or eternal terminal] + tmux combo solves the problem (and adds other cool functionality)

https://mosh.org/

https://eternalterminal.dev/

https://github.com/tmux/tmux/wiki


I hate tmux because the way it handles switching windows is terrible. The model is switching between windows, each of which can have some layout of panes. This is approximately never what I want for a long-running thing I can detach from and reattach to, from a different terminal with different connections. Further, it supports multiple attachments by each having them see the same exact thing. For terminals that are different sizes, it shoves everything into an area the minimum height and width of all attached clients, and displays everything outside that as a field of periods.

I really prefer screen's functionality, instead. Each pane can be independently switched to any of the pool of underlying ptys being managed. The exact layout of panes is more of an ephemeral thing. On wide terminals I can split side-by-side, on narrow ones I can split top-above-bottom (and these can even be different for different clients connected to the server). It handles the multiple attached clients by only make

I think tmux's model is due to being too closely modeled on GUI virtual desktops.

That rant aside, tmux does have some nice features available by default that take a fair bit of configuration to get working well in screen.


tmux supports multiple sessions and switching between those. sessions behave independently like your screen windows. you can of course have multiple windows in each session, which then behave the tmux way. you can also have the same window show up in multipe sessions and use that to emulate screen behavior. (i saw a script to help with that some time ago) for me, multiple sessions alone provide enough flexibility.


It probably technically does, in a horrifying Turing-tarpit way.

I actually did try to get the behavior I wanted out of nested tmux invocations and a hairy mess of scripts. There was no 'next-session' command (though now it looks like 'switch-client -l/-n/-p' would work. Hmm.)

I couldn't get it to work right, only a half-done approximation, and it involved way too many tmux interpositions.

Screen: client -> per client view of P panes -> P of entire pool of T terminals in session.

Tmux: client -> 1 of S sessions -> 1 of W windows -> P panes

To get splitting at the top level I need one top-level tmux client and its associated top-level session managed by one server. It can actually be locked to one window for what I care about.

In each of its panes, I need to run another tmux client so as to be able to actually change what final pty each pane will display. And each of these need to be separate sessions, so that I can display separate things in each of them. Each of these separate sessions will generally only have one window, with its implicit single pane. I should, of course, run all of these inner tmux sessions as a separate server from the top one.

Now I just have to make restricted keybindings for both the top and inner clients, and make make sure that any binding for the inner clients are replicated as "self-insert" in the top-level client. And remember that there are multiple places I can direct the command-lines (so need two bindings).

This results in the top level server having: client -> session -> 1 window -> P Panes -> P clients

And the bottom server having: P clients -> T sessions -> T windows -> T panes

And this doesn't yet let you have multiple top-level clients attached the way screen does.

I'm sure I could eventually sand off all the sharp edges where it doesn't do what I want and make this work.

Or I could just use screen.


well if you want an exact emulation of screen behaviour, sure, that may be difficult. but i think you are making this a bit to complicated.

if what matters is that different clients can look at different windows, then using multiple sessions with one window each will get you exactly that.

you may of course need to get used to tmux way of switching sessions (which is not different than tmux way of switching windows). however, i just checked, there is a command to switch to the next session, and you can bind that to a key. add another keybinding to create a new session, and that should get you to about 90% of your expected behaviour without nesting tmux.


That last 10% matters. I absolutely need split windows, and being able to switch what is in the sub-panes easily. That's what require nesting.

The fact that it also doesn't have my preferred behavior for multiple clients connecting is just a minor nit-pick at that point. And I did say that the misnamed 'switch-client' would likely work.

Tmux has some very nice features: a nice commmand language, xterm-style mouse support, including both event binding and sending to client (pass through only), well thought out client-server separation, the ability for other programs to fairly easily drive it, visual identification of panes, menu-popups, and the default status-line is nicer.

All of that is merely nice to have, not actually needed though. It fundamentally doesn't have a model that works well for me. I wish it did, or that screen gains such things.


> That last 10% matters. I absolutely need split windows, and being able to switch what is in the sub-panes easily. That's what require nesting.

That could be solved by not relying on tmux for splitting, but on a terminal that has that feature, one ssh+tmux connection by panel.

In any case, it seems that you have put a lot of thinking about this, and probably already considered this solution.

I've been in the situation of trying to make my workflow perfectly fit in my new software/os/laptop/etc and not being able to make it work 100%. It's... sometimes exhausting. Nowadays I take another approach: make it fit well enough.


one ssh+tmux connection by panel

well that would be like putting two terminals next to each other.

doing that inside the terminal instead has the advantage that it works on remote terminals too.

there is splitvt, but it doesn't seem to be actively maintained anymore


> well that would be like putting two terminals next to each other.

Pretty much! Not a perfect solution for sure.

I used to only rely on tmux, but nowadays I often prefer opening more terminals windows. I found out that by overly relying on tmux, I kept opened a way too high number of shells (browser tabs, anyone?).

Now for most tasks I open a terminal without tmux, that way I'm forced to close it if I don't want to have a cluttered desktop (I never minimize or hide windows, I don't even know how to do it on my wm).


oh, i see, i missed that. so you want each of the split windows behave like its own session or something, and switch terminals within that freely.

that does make sense and i can see that's a useful way to work actually.

so how does screen do that?


> so you want each of the split windows behave like its own session or something, and switch terminals within that freely.

Exactly. Or at least that's one way to describe it, though it might also equally describe other things.

> so how does screen do that?

It's screen's native model.

'C-a |' (split -v) splits side-by-side and 'C-a S' (split) splits top-and-bottom. Screen calls these "regions". 'C-a X' (remove) will remove the current region, letting the sibling take the entire space again. 'C-a Q' (only) will replace the entire layout with the current region. 'C-a tab' (focus) will jump to the next region (it can take directional arguments to move up, down, left, and right, as well as 'prev' to go the opposite way of 'next', but these are not bound by default).

What tmux users would normally think of as window changing commands just switch the current region in the layout between viewing the entire pool of running commands.


i seem to be missing something. tmux has all those too.

I never said it didn't. It's the way they combine and interact that differs.

> I think tmux's model is due to being too closely modeled on GUI virtual desktops.

Yes, using a terminal emulator that natively support tmux control mode (like iTerm2) is really nice because you can easily resize/rearrange the windows and panes via GUI actions, but absolutely suck when you have to reattach using a traditional terminal emulator because you now have to adjust all those windows and panes with keyboard actions.


If you use a terminal emulator that supports pointer events, tmux can use those.

You generally just need to do:

set -g mouse on


> It handles the multiple attached clients by only make

... by only making the ptys that are visible in multiple clients have the same size.


Except that screen is dead :-/


Dead? 4.8 came out less than a year ago and there was a patch less than a month ago.

https://git.savannah.gnu.org/cgit/screen.git/


If you require SSH features (e.g. port forwarding) or want to continue using your terminal's native scroll functionality, here's another alternative that a friend and I devised:

https://mazzo.li/posts/autoscreen.html


This is great, thanks!


I hate mosh because it nukes scrolling. I hate tmux because I just can't remember the keybindings. I've been using mosh+byobu since the keybindings are a bit easier for me. But I'd love regular ssh to be resilient like mosh, or maybe mosh can start supporting scrolling.


I use mosh + screen, because you can tell screen to act like it has normal scrolling with a simple config and then the only keybinding you need to remember is disconnect.

    defscrollback 500000
    scrollback 500000
    termcapinfo xterm* ti@:te@
There you go, normal scrolling!


Thank you very much, I use screen rarely and the thought that screen can be configured has never crossed my brain.


Not much configuration, and slightly annoying that the config file is at $HOME/.screenrc instead of being patched to use an XDG location, but it's "good enough" for my use cases.


Doesn't seem to work for me. I use mosh to start the session, then screen to create a new screen. But this does not scroll using my mouse. I can scroll using Ctrl+A and then arrow keys. But that's not what I'd call normal scrolling.


Then your version of mosh is breaking on termcapinfo. I can't really help on your particular version, but that's not screen breaking. It has told mosh it supports scrolling with the mouse. Case of "works over here".


Thanks for the tip anyway. I'll have a place to start investigating the problem.


Mosh is designed to only synchronize the last screen of changes and will never support scrolling.

https://github.com/mobile-shell/mosh/issues/122#issuecomment...

http://web.mit.edu/keithw/www/Winstein-Balakrishnan-Mosh.pdf (Section 6)


I got around the `tmux` barrier by writing down my most useful commands on an index paper, cutting it into the smallest rectangle I could (and folding in half), and tucking it in the pocket of my phone's flipcase (or you could stick it inbetween your smartphone and its back-shell).

I'm pretty happy I got over that hurdle.


Mosh is great - it drastically improved my ability to work remotely over my DSL connection.

Adding byobu made it effectively impervious.


As others have pointed out, mosh doesn't support port forwarding, and because ssh doesn't support udp forwarding, can't easily be used with proxyjump servers.

Story time: I had tmux sessions active on all our servers and would simply ssh into my work laptop from home, so the sessions were never really closed. One day, I decided to upgrade from 1404 to 1604 and closed out all my sessions (including on the servers because I was pushing out a new tmux config). After 5 minutes we started getting smss that the system was down and couldn't write to disk. One of our production servers had been set up with an encrypted home folder and when my session closed out, it closed the encrypted folder. Unfortunately, the ssh folder wasn't outside the encrypted portion, so we had to use IPMI to restore access. That's the story about how we started joking that closing my laptop is a great way to break the production system.


This kind of thing is why I have always been very sceptical towards encrypted home directories, and have always advocated for full disk encryption.


I use tmux when when ssh-ing from an iPhone (which kills background apps after 10s), it works great for me.

I also use it when I know I'll need to finish something on a different machine.


Which ssh client do you use? Is it free software? And if not, how do you trust it not to steal your keys?


I use termius. They're quite reputable (and YC backed [1]).

You're right about the keys if I'd be syncing them. But for my specific use case I happen to use passwords for these servers.

[1] https://termius.com/about


It would be very unusual for an ISP to drop idle connections. This implies all your connections are going through a layer 4 router. More likely you have a statefull firewall in the path somewhere. home router, server ISP firewall, etc...

[Edit: ] Or in this case, an ISP in Denmark that is trying to minimize ipv4 cost by using LSN (carrier grade nat) which also has many other drawbacks.

A SSH session does not generate any traffic

This does not have to be true. You can enable TCP keepalive in the server and client configuration.

Client via ~/.ssh/config:

  TCPKeepAlive yes
  ServerAliveInterval 60
  ServerAliveCountMax 2
Server via /etc/ssh/sshd_config:

  TCPKeepAlive yes
  ClientAliveInterval 60
  ClientAliveCountMax 2
Why are the TCP keepalives only sent after 2 hours?

Each OS has a default time set for keepalives. If you do not specify it in the ssh config, it will use the OS default. In Linux, you can set this in /etc/sysctl.conf:

  net.ipv4.tcp_keepalive_time = 60
  net.ipv4.tcp_keepalive_intvl = 60
  net.ipv4.tcp_keepalive_probes = 2
After adding this, run sysctl -e -p

You can see the timers on your established connections with:

  ss -emoian | grep tim
Note: TCP timers are not the same as ssh client and sshd server tcp keepalive packets. These are two distinctly different mechanisms that can accomplish the same thing. Not all applications support TCP socket keepalive. You can wrap applications with a library called libkeepalived to add support without code changes by using LD_PRELOAD.

In Windows this is set in the registry [1] On mac this is set via sysctl similar to Linux.

After you have adjusted your client and server config, restarted sshd on the server, then ssh to your server using the flag -vv and you will see the keepalive packets.

[1] - https://serverfault.com/questions/735515/tcp-timeout-for-est...


Correction: TCP keepalives, ssh server keepalives, and ssh client keepalives are three distinct and independent mechanisms. You only need one.

I usually just do client keepalives as they are easiest to set up. Server keepalives are good if you are worried about “forgotten” clients. TCP keepalives are usually not worth it IMHO.


I also changed to using client keepalives after something in our office network changed: they installed new switches and access points and suddenly my ssh sessions wouldn't stay open. After getting nowhere with IT (mainly just a low priority issue to them) it was just less frustrating to enable keepalives and the problem disappeared, so that's my default config everywhere ever since.


I think TCP keepalives are conceptually the best though. As your problem occurs at the transport level, not the application layer.

This way you solve it where the issue occurs, and with the added benefit that it works for all TCP connections, not just SSH.

However I haven't had this issue. My isp is pretty ok in this regard and I supply my own router. So I don't know if there's issues with this in real life.


Some NAT implementations ignore TCP keepalives. Alcatel branded ADSL modem/router I had used in 2005-ish certainly did and IIRC some more recent Zyxel ones do the same.


I describe those workarounds in my post as well. But that only solves the problem for me.

Making my ISP fix the underlying issue - that their TCP connection idle-timeout is too short - will make sure all their customers won't have to encounter this problem.


Edit: I missed the part that their network used LSN.


Please read the post. My ISP already confirmed the problem, and told me that they expect to roll out a fix this week. I live in Denmark, and here it is fairly common that ISPs do Carrier-grade NAT.


I think saying its common in denmark is properly overstating it a bit.

For wired connections I think its only the small newish ISPs + stofa that does CGN, the rest like tdc and telenor provides IPv4 to the CPE equipment.

I have hiper, they do CGN by default but if customers ask for it they can get a dynamic IPv4 for free or a fixed one for a small fee.


What does their fix look like? I guess you can't change this limit for all connections otherwise they'd have to buy more IP addresses for their NAT routers, so maybe they only fix it for SSH connections, them being few?

I had the same problem and did the ~/.ssh/config trick years ago. Interested in contacting my ISP so that they fix the problem for all users (although it might be fixed now, idk).


They will increase their "TCP established connection idle-timeout" from 1 hour to 2 hours and 4 minutes as I requested.

This shouldn't make much difference for them. Most connections are closed within a few seconds anyways. Long lived connections with no traffic are rare.

With no data whatsoever, I'm guessing less that 1% increase in NAT table size.


Is there a list of Danish ISPs that do this?

I've had YouSee since I moved here, and I have a single public IPv4. I didn't realise that was not standard.


When I lived in Denmark, 3 would often use carrier-grade NAT, but not always. Based on talk with colleagues back then, it's quite common with mobile broadband.

Here in Finland, the situation is similar; when using mobile broadband you usually end up behind CGNAT.

Luckily, most ISP's will happily provide a static IPv4 for you for a small fee.


I missed that part. I would not have expected that in Denmark. LSN is awful. You will be sharing source port depletion limitations with others in your network. That also means you can't host any servers unless you use port forwarding services or reverse vpns like hamachi. It also means you are sharing a SNAT with others on your network which means that malicious traffic from others could be attributed to you. Glad they are fixing it for you. If they didn't, then one would hope there were other ISP options.

Any ISP using LSN will have low NAT timeouts because it takes memory on their routers to track sessions and state. I would be surprised if your ISP remove timeouts unless they are letting it fall back to FIFO pruning on your segment. Did they tell you what they are changing?


It sounds like he's paid his ISP for a (dedicated) public IP, so it should be 1:1 NAT, which doesn't really need connection tracking.

For the rest of the customers that don't pay extra for a public IP, all the crappy things you mention do apply.

Hopefully, the ISP does native IPv6?

And, while 60 minute timeouts violate the RFC, it's a whole lot better than I expected. Usually CGN timeouts are around 15 minutes for nice ones, and I've seen 10 seconds at the bottom end.

I wish the longer ones would probe both ends of the connection to see if it's still live a minute or so before they intend to kill it.


What you say sounds very dramatic, but the truth is that CGNAT is good enough for 99.9% of users.


That's bullshit, CGNAT is likely to cause all sorts of issues that the average users aren't going to realize being caused by their "I"SP (A frequent one : being unable to host video game sessions). They aren't getting real Internet, and are being treated as second tier citizens.


Yeah, my ISP uses it. It does come with some of the downsides the previous poster mentioned: the inability to make myself reachable from $the_world can be annoying, and I get a captcha on Google every time because of "unusual traffic" (I mostly use DDG, but sometimes fall back to it). Also, ACM blocked me at some point because "my IP is infiltrated by SciHub" (their words).

In the end, it's an imperfect solution for a real problem that mostly works well enough.


ServerAliveInterval and/or ClientAliveInterval fix that behaviour just fine, for everyone, if they use it.


Do you know of some mechanism that makes ssh sessions survive a power-suspend (on a Linux desktop)?


That is pretty much an inverse problem ;)

If you care about that you probably should use mosh as that does solve that by design and not by random chance.

On the other hand using VPN with fixed tunneled endpoint IPs causes idle TCP connections going through it to remain connected pretty much indefinitely.


Just don’t use keep-alive feature. Without keep-alive traffic the peers have no way to tell the interface was transiently unavailable.


This is needed, in addition to the machine getting the same IP address back after resume (static assignment or long DHCP leases).


You should check out mosh.


The very first thing I do whenever I ssh,

  $ screen -x
If you don't do this or something similar, then perhaps you should start now.

https://www.gnu.org/software/screen/


I've always use `-D -R`. I just checked `man screen` and it actually states "This is the author's favorite."


That's my thing as well. Actually screen -rdU - U from old times when weird text chat apps were used with my weird non-utf locales.

Tried tmux a few times, but I'm not gonna re-learn everything every 15 years, come on! Also luckily when I switched jobs all admins used and installed screen everywhere, so that was an easy fit.


That doesn't really help when transfering a file. That was one of the user cases where the author had problem with.


Yes it would because the file transfer would continue inside the screen session instead of dying with the ssh session.


I think they meant they were using scp to transfer files from the remote computer to the local one.


That wouldn't be an "Idle SSH Session".


Or keeping an SSH session open for proxy tunneling when PuTTy is only capable of automatically reconnecting on certain disconnects and not others for some stupid reason.


rsync over ssh solves this... Just start it again and transfer the remaining part.


"I documented my findings, and sent an email to my ISP. I quickly got a response back acknowledging that this is a bug on their side, and thanking me for my research."

I am actually shocked that: a) the ISP has an email or any sort of asynchronous communication. Most in the US have at best a "talk to a bot" functionality b) they acknowledge the behavior and did not simply respond "have you tried rebooting the router?"


Same issue from one of client location. Didn't bothered with debugging but switching connection to over Wireguard vpn resolved it.Just 2c in case someone has similar issue.

EDIT: Also see Mosh (https://mosh.org/)


Just a little FWIW, I am with Hyperoptic in the UK (pretty common if you live in an apartment in a big city centre). They provide nice fast, cheap, fibre broadband (1Gbit/sec in both directions). I started allowing remote connections on my Plex (media server) but couldn't get it to work, which is puzzling as I have been setting up put forwarding for 25 years, so thought I had run out of unsolved mysteries. Then I read about CGNAT. Then I found out Hyperoptic uses CGNAT. I felt like I had been swindled, an internet connection which ample bandwidth for hosting services with no way to have any incoming ports (as no matter what portforwarding I set up on my router, the ISPs CGNAT router doesn't let me set any portforwarding up as it is out of my control).

I spoke to Hyperoptic, they said I could have a fixed (non CGNAT) IP for five months for free and then £5 per month thereafter. After five months I noticed they had started charging me £1.25 per month. I am not sure if that will increase to £5 at some point, but either way, I am happy to pay to have a "proper" internet connection.

I share this as perhaps others will be in similar situations but not realize some ISPs will let you escape the CGNAT. They switched me after a five minute chat and within 2 hours (considerably less but they said up to 2 hours) I restarted my router and I was good to go.


My ISP in Germany charges 5€/month for a non-GCN IPv4, but static addresses are unobtainium outside expensive business contracts. At least most ISPs have good IPv6 support nowadays, even on mobile.


Thanks for this, as I am thinking about switching to Hyperoptic, and am interested in self hosting several services.


Hyperoptic is amazing. I cannot recommend them enough. The only annoying thing about them is that not even a dynamic IPv4 is included in the price.


Maybe they could offer you IPv6 cheaper?


IPv6. Push for it.


I'm really looking forward to an IPv6 future with no NAT. My hope is it will empower people to easily host websites from home, make peer-to-peer connections, and generally own their stack.

Example: you want to access a home camera (or other IOT device) from your phone. Right now I don't see how to build this as FOSS without any third party or privacy concerns. With static IPv6 addresses it should be pretty easy.


Even with IPv6, almost all consumer routers are configured to deny all inbound connections by default, which is a huge damper for getting the average Joe to adopt peer-to-peer software.


I agree!

I'm on Comcast in California, and I found that they're providing IPv6 (no CGNAT that I can see) through to my (personally-owned) router (an Asus RT-AC68U). So all my systems at home are getting an IPv6 (or multiple) using the /64 dynamically allocated by my ISP.

And today I just discovered that my parents, who get service from Cincinnati Bell FTTH, are also getting IPv6! They're using an ISP-provided router, and everything is just working.

I am really happy that things are rolling out, albeit slowly.


For all of the issues with Comcast, the one bright spot with them is their IPv6 support. They were one of the first to support it well.


Ah, I didn't know that! But it makes sense; their network is certainly large enough.


Well, duh, there's no NAT in IPv6...

A dynamic /64 is still not proper Internet though.

As a residential customer, that would be a static /56 at least :

https://www.ripe.net/publications/docs/ripe-690

> /64 is not sustainable, it doesn't allow customer subnetting, and it doesn't follow IETF recommendations of “at least” multiple /64s per customer.

(Why are ISPs being skimpy on IPv6 addresses?? Doesn't this imply that they will need to do extra work in the future to move those /64 customers to /56 or /48 ?)

> An alternative is to reserve a /48 for residential customers, but actually assign them just the first /56. If subsequently required, they can then be upgraded to the required prefix size without the need to renumber, or the spare prefixes can be used for new customers if it is not possible to obtain a new allocation from your RIR (which should not happen according to current IPv6 policies).


Can you get static IPv6 addresses assigned? Or do they change periodically?


This might hurt privacy a bit as now each device in the house is uniquely tracked (already possible through other fingerprinting, but with this much moreso).


That's what IPv6 privacy extensions are for. The first RFC specifying that is from 2001 and it has been available in most operating systems for a long time now, although it was buggy for a while in windows.


Indeed. There's no point in doing all of what the OP did.

IPv6 was finalized in 2017.

In 2020 Europe ran out of IPv4 addresses, and many Asian countries never had enough of them to start with (so quite a bit of people are effectively IPv6-only already).

An "I"SP that doesn't provide a /48 or /56 IPv6, shouldn't be legally allowed to advertise that they are providing "Internet" (and technically/historically, they're actually providing ARPANET, IPv4 having been supposed to be only a temporary, experimental version.).

And just like it was done for obsolete TV technologies, laws should be put in place first outlawing hardware that isn't compatible with IPv6, then later hardware compatible with IPv4.


Is there any tips on how to make IPv6 easier to use for typical day-to-day network administration? One advantage of using IPv4 is addresses are easier to memorize, so when you're building a network, you can keep track of everything in your head. I think this might be a major reason people dread setting up an IPv6 network, at least for me.


Usually your ISP delegates a /56 or more to you.

From what I've seen, it looks like /64 are thought of as a vlan, within which clients can perform SLAAC.

For static IPs, I usually concatenate /56 + :id: + :suffix:.

Like: home computers on /56 + ::1 + SLAAC. Most OSes will dynamically change their IPs for privacy reasons.

My servers are on /56 + ::0 + :100,101,102, etc. I generally pick these suffixes to match with the IPv4 addresses, but you can allocate one per service, and get rid of reverse proxies (easier migration, you can just move the service to a new machine).

So, to take a specific example, 2a01:cb14:d6e:2000/56 is my ISP prefix, which can be thought of as the external IP, and 2a01:cb14:d6e:2000::11 is my server. 2a01:cb14:d6e:2001::/64 could be computers. I don't always follow the above scheme, IPv6 is big enough to get away with a lot of things, but it helps having something to default to.

My point is: you don't have to remember the prefix anyway, since every computer in the network will share it. Now, if you need static, easy to remember IPs instead of SLAAC, use static IPs or DHCPv6, or even better, mDNS to resolve .local addresses to IPs.

Looking at the above, this assumes a certain level of trust on the local network, which is fine at home or within a network dedicated to servers, but might not be at a company? mDNS can lie, someone else might advertise the same IP. These problems are not exclusive to IPv6, but they are a product of the era. Nowadays, I wish we just used crypto key routing (like yggdrasil does, and maybe TOR) on a planetwide mesh network, but we'll need IPv6 in the meantime :)


:: shortcuts or domain names. Also, AFAIK you're not supposed to be using fixed IPv6 suffixes, for security reasons.

(Some people even advocate that consumer router IPv6 firewalls should be opt-in – which millions of them still are – and as you can guess with how opt-in works with consumers, the overwhelming majority of them therefore use IPv6 without a firewall.)


Run a DNS server linked to your DHCP. I have a Pi-Hole set up which maps every device on the network to <hostname>.mydomain.uk.

If I don't like the hostname (some IoT devices don't allow changing it) I can map a different name to that MAC address.

(I still use v4 and have no need to remember more than 2 IPs. Should make migration to v6 much easier.)


I find v6 addresses no more difficult to remember that v4. It's a question of what you're used to and what you practice, both of which take time and exposure.

I can tell you the prefixes on my home ADSL connection, but not necessarily the ipv4 subnet, just because I work with the V6 addresses so much more often.


Make them memorable then - Facebook do:

host -6 www.facebook.com www.facebook.com is an alias for star-mini.c10r.facebook.com. star-mini.c10r.facebook.com has IPv6 address 2a03:2880:f158:82:face:b00c:0:25de


I see an IPv6 address on a computer screen. I want to connect to it from another computer. Copy pasting doesn't work between computers. Am I supposed to type it letter after letter? Am I supposed to send somehow? What if I don't have internet access on the first computer? Do I need to go find a USB stick so I can transfer a file with the IPv6 address? Am I supposed to set up a DNS server of some kind?

I'll keep my IPv4 thank you very much.


Do you seriously ssh to raw IPv4 addresses? Everywhere I ssh to (including my home server) has a DNS address and that's how I connect to it.


Whenever I have got a new Raspberry Pi or PinePhone and connect it to the home network, I always SSH to the raw IP first. (Sure, at some point I’ll configure the router’s DHCP settings to ensure the new device gets a stable IP address, and then I can just use an SSH alias.) I would imagine that this is a very common use case.


I got a new NAS last week and only ever sshed to [hostname].local. Didn't have to configure my router at all. Come to think of it I don't even know (or care) whether I was SSHing via IPv4 or IPv6.


Any decent home router will automatically add the hostname of all DHCP(v6)-configured devices to its DNS service. You shouldn't need raw addresses at all.


I have not seen any routers that do it, my ubiquity amplifi router certainly does not do that.

That's very strange given that every cheapo ISP router I've had does that. How did you confirm that it doesn't?

I don’t know about him, but I do, quite frequently.


I use https://dns.he.net, with an hourly cron job that runs 'curl' to keep the address updated.


I think this is a legitimate downside of IPv6, but a small one in the big scheme of things. Considering in a world of IPv4, most people don't get to have an IP address at all...


Zeroconf for local networks, free dyndns for public addresses.


Isn't this a bit dramatic, sounds more like an excuse than a legit reason to stay off v6. It happens extremely rarely and there's only 4x the amount of bits in a IPv6 address so it's not an insurmountable task, and nothing prevents you from concurrently using rfc1918 v4 addresses on a home network.

If this is a concern you can assign easy to type addresses such as 2001:1868:a106:101::120


When is the last time you had to type an IPv4 address that you couldn’t copy paste?


You could use base85


This + VPN idle killing sessions has forced me to get much better at tmux. I highly recommend for others who use SSH often, especially with WFH, to give it a try. The pane structure has made me much more efficient at my work nowadays.


When my company started, we used tmux in PROD as a process supervisor. We had tons of lines like:

    while true; do ./runserver; sleep 10; done
in a shell script that would start a new server with one window per process. It got called from /etc/rc.local on boot IIRC.

Deploys meant pulling the new code on the server (or maybe just editing it in vim right there), then just Ctrl-C in every terminal (or when I got lazier, `killall runserver`).

This was back in 2010 or so... I have professionally come a long way in the intervening decade, and no longer have any PROD services running in tmux panes, but I definitely learned to love that tool.


I've done things similar and do it for quick and dirty. Currently I have something like that on a RPi but instead of

    while true;
I have,

    while [[ ! -f "$DIRECTORY"/backup_exit.stop ]]; do
        sleep 3600 # sleep an hour
        time /home/ubuntu/.pyenv/versions/backup/bin/python main.py
    done
Whenever I want the process to stop, `touch backup_exit.stop` . This waits until the run finishes and exists on the next loop.

Anyway, that is saved to a 'backup.sh'

Then I have also,

    #!/bin/bash
    session="test"
    backup="/path/to/backup.sh" 
    
    #create detached session named test
    tmux new-session -d -s ${session}
    # Create windows
    tmux rename-window -t :0 'backup' #rename the first one
    tmux new-window -n 'htop'
    # Run processes
    tmux send-keys -t 'backup' "$backup" ENTER
    tmux send-keys -t 'htop' 'htop' ENTER
    
    
And this is run automatically on @reboot via cron.


This is super interesting! Wouldn’t nohup do the same thing though?


+1 to tmux, or even screen if you can't get tmux installed.

It also protects against flaky network connections and accidentally closing the local terminal emulator.


tmux is legitimately one of my most useful tools in day to day work.


One should set both ServerAliveInterval and ClientAliveInterval settings on ssh to a reasonably small time (e.g. 1 min or 5 min). Note that there's another setting {Server,Client}AliveCountMax which multiplies this to find the actual "connection is dead" determination time.

The tradeoff is between a network disconnect-reconnect killing your connection when it shouldn't (because you've noticed the disconnect but you wouldn't if you didn't send heartbeats), and discovering the network disconnect or dead peer (which you wouldn't if you didn't send anything, thinking it's still there).

Personally, I prefer to know the connection is dead sooner than later even if it comes back on its own.


To transfer a multi-hundred-GB file I'd use something restartable like rsync or wget instead of nc.


The issue in the article could have been solved by running a long-running command with no output in screen or tmux. If ssh disconnects, just reattach.


no, because this example was site->site data transfer across the disconnecting link


yes, because the issue was not the site data transfer, but the idle SSH session(s) running the nc commands. When the idle SSH session got dropped, it took nc down with it and that stopped the transfer.


no need for tmux, nohup would have done the job.


Just ping it.

Ping it.

Ping it

Ping it

All you got to do is ping it.

https://stackoverflow.com/questions/13628517/is-it-possible-...


Wireguard will help you work around this. Just tunnel ssh through it. You will still have nat failures but wg will reconnect and things will look stable to ssh.


I discovered this same issue with my ISP.

Solution: I’ve found that simply executing “top” generates enough activity to keep the session alive until you get back to it.


> This happened one day while I was transferring a VM image in the hundreds of gigabytes from one server to another using netcat like so: > ... > my SSH connections had died yet again. And sshd had taken netcat down with it, killing the transfer midway.

As somebody who has experienced the described behaviour, I can empathise because it is super annoying. But this isn't a particular good point with which to illustrate it. If you're doing any sort of work on one or more remote servers that shouldn't be accidentally terminated like a large data transfer, you should be backgrounding it with screen or similar. Even for users without the idel connection dropping as described, the Internet can still fail.


Running tmux or screen with top in one window should fix it.


What happens when it doesn't? I've messed with my ssh config and TCP keepalive settings. I've made sure to set my computer to not go to sleep when I have open network connections. No matter what I do, if I leave the terminal window open overnight and come back in the morning, I get a broken pipe and have to log back into my cluster, and reattach my tmux session. I've pretty much just given up on finding a solution.


Or just display the clock in the tmux status bar.


By default `<prefix> t` shows a clock in the current pane, really handsome looking and it scales nicely as the pane changes shape.


Has anyone implemented OpenSSH over QUIC yet? If ever there was a match made in heaven for pairing an asynchronous channels protocol encapsulated over TCP with an asynchronous channels protocol encapsulated over UDP, this seems like one.

EDIT: Yep, there's `draft-bider-ssh-quic-09` and a localhost proxycommand that offers both-sides OpenSSH integration:

https://datatracker.ietf.org/doc/draft-bider-ssh-quic/

https://github.com/moul/quicssh


> If ever there was a match made in heaven

I think SIP+RTP over QUIC might be a contender for that title. No more NAT traversal wackiness caused by needing both a TCP connection and a UDP connection. I still can't figure out why SIP didn't use TCP-over-UDP for the low-bandwidth control packets.

QUIC has protocol-level support for running both a reliable (TCP-like) and datagram (UDP-like) substream over the same QUIC connection. Cannot wait for SIP-over-QUIC!


When SIP first came out, UDP wasn’t safe through NAT like it is now. Lots of cheap home routers would just wreck it. It’s gotten better since then.


It's really a problem that many ISPs are lagging with transition to IPv6 still.


Might be because while IPv6 rollout to residential customers started more than a decade ago, including governments making a lot of noise about it back then (and a few years later too), IPv6 wasn't really finalized until 2017 ?


It's not surprising. Try dealing with consumer routers and their varying degrees of support for dynamic assignment of IPv6 addresses. Seriously, it's a support nightmare.


Many ISPs ship their own routers to consumers. So they are mostly in charge of the tech stack. The fraction of users who use their own router is small.


Not necessarily. There are plenty of ISPs that have to go through wholesale networks where the underlying carrier has control over the modem and how provisioning works. This is particularly painful with cable modems.


Surprisingly, the problem is often their backend support for it, not so much the end user issues.


It's not surprising because IPv6 completely ignored transition planning:

https://cr.yp.to/djbdns/ipv6mess.html

Twenty years later, it still hasn't displaced IPv4. IPv6 was designed by a bunch of telephone-company guys who were used to Ma Bell being able to declare a flag day. That doesn't scale up to the Internet, and we're all suffering for their failure to understand backward compatibility.


This. When I (as an ISP) was looking into the consumer router situation 10 years ago, the protocols required to handle dynamic IPv6 assignment to end users in conjunction with PPPoE simply didn't exist. Specifically, without NAT, the ISP has to somehow delegate an IPv6 subnet to the customer for their local network. The mix of IPV6CP and DHCPv6 simply weren't able to do that, and if you've ever had to deal with this for more than a handful of customers, you know that getting $joerandomenduser to manually configure an IPv6 subnet just isn't going to work. Sure, if you can control what router every customer uses you can come up with a procedure for that, but going through the wholesale network of an incumbent means you don't have that luxury.


I haven't read TFA, but it happened to me on an ISP which had a CGNAT, probably as a way of garbage collecting open connections. The solution was to use an application-level keepalive, which winscp and putty support.


*workaround.

The solution is for the ISP to fix their misconfigured NAT.


Taking into account they had to take extra steps to enable this "feature", they probably don't consider it to be "misconfigured", at least from their point of view.


A NAT is going to have to have a timeout, otherwise it will gradually leak and run out of ports. All protocols that operate behind NAT must implement keepalive.

The solution is IPv6. Then you don't need your ISP to maintain a stateful connection table.


I'll be pedantic: you mean that the solution is no NAT, with IPv6 being something needed to get there. Nothing stops you from NAT'ing IPv6.


> Nothing stops you from NAT'ing IPv6.

Nothing stops you from filling your car with orange juice either.


NAT'ing IPv6 works and is merely not a necessity. It could still have a purpose, It just won't be address exhaustion.

Filling your car with orange juice presumably stops it from working and is likely to cause damage, all while your parents question where things went wrong.


As proven by Microsoft in Azure's IPv6 support.

The solution is IPv6.


There is no such thing as a properly configured NAT implementation that does not have timeouts for idle sessions. Without those you’d run out of memory on your router and new sessions would be blocked.


NAT is fundamentally not the solution.


The app-level keep alive will work for a while, probably a long while, but could still fail if there are enough connections using the method and the CGNAT routed has too few source addresses to map thing to. If the router needs to find some ports to use for new connections, and there are no apparently idle connections to throw out, it has few choices:

1. Just stop making new connections until some ports are freed. That'll make people happy...

2. Kill the connections that have least recently seen activity even if they have sent/received packets within the usual timeout.

3. Kill the longest running connections that aren't from a whitelist of target ports like 80 & 443 (P2P and VPN systems will just reconnect, the user will hopefully see no more than a short blip SSH will not fair so well).


Interesting. I hadn't really been familiar with carrier-level NAT. How does that work with say, a home server that other people connect to using only an IP address and a port number?


It doesn't.


So one day I could wake up to find it's impossible to host a Minecraft server at home, with nothing I can do about it?


Perhaps - depends on the growth rate of the ISP and how many IPv4 addresses they already have (and how much money they are willing to spend to acquire more).

But the real fix is to push your ISP to deploy IPv6. No need for the ISP to run carrier-grade NAT and you can host as many services at home as you want.


Yeah, several of people I played with using various games over the years had this issue.

I even had it too when I used cell Internet, but cell ISPs have a better excuse.

We generally found someone else able to host, except for some 1vs1 situations.


I'm working on a product that requires to keep the same TCP connection open for a long time, and my company found the same:

After a lot of resources invested in debugging, switching out hardware, and interminable log files we suspected the ISP is to blame, so we built an MVP that was connected directly to what the ISP provides, and a client on a different ISP that wasn't closing connections. We saw the connection close at exactly the same time as before.


Wait, this person is in Denmark and can't change ISP?

If this happened to me then I'd change ISP. NAT on your home Internet? What is this, the US?

I thought if you were in Europe you'd pretty much always be able to get public IPv4, IPv6, and IPv6-PD. I know I can.


It's possible that he insists on IPv4 for some reason, though IMHO this is a bad idea and he should try everything to move to IPv6.

As you might guess, the ISPs that are forced to use CGNAT due to the lack of IPv4 addresses, are also the ones who tend to be the first on IPv6, for obvious reasons.

I know of a big ISP, a relative latecomer, infamous due to its CGNAT issues, who last summer boasted reaching 99% IPv6 coverage.

(And at the same time, for some reason they had to be forced by the government threats, that otherwise they would not be allowed to use 5G, to add opt-in IPv6 on cellular, which they did at the last moment last month...)


CGNAT is standard for all smaller ISPs in Denmark. I'm not sure if it's mainly to stay competitive or due to limited supply, but they offer public IPv4 addresses upon request, sometimes for free. It makes sense really; if you default to CGNAT, which is fine for 99% of users, there will be more public addresses for the people who need them which keeps costs down for both segments. It's not the ISP's fault the standard is outdated.


I take issue that it's "fine for 99% of users" or that it's not a big deal :

https://news.ycombinator.com/item?id=25744675


That is not a realistic scenario. Firstly, there's no waking up to find your server is unresponsive. It's part of the deal when you sign up. Secondly, public IPs are always an option. Thirdly, the ISPs I have used have always had relevant FAQ/help articles.

If any of those points do not apply to your ISP, that's not because of CGNAT, it's just a shit company. I've used four different ISPs in the past four years and getting rid of CGNAT has not been a problem once. IIRC only one of them used public addresses by default, two offered free dynamic IPs upon request and my current ISP offers paid static IPs for $3.

Tons of my friends and family have had CGNAT and never known. It's just not a big deal for most content consumers.


He can but instead he contacted his ISP and got it fixed, which is much better. I don't know which ISP he is using but sounds like it is on the awful (by Danish standards) TDC network. Good thing there's great fiber almost everywhere.


TDC does not by standard use CGNAT, they give out dynamic IPv4.


You should try tmux over ssh over mosh. Your session lives on thanks to tmux. And mosh reconnects you automatically, even when your IP address changes ;-)


The good ol’ days, when you had to keep something sending a steady packets across the ssh session to keep the connection from dropping.


It's worth mentioning if you have a large job that will need to finish you should disown the processor or use setsid:

setsid somecommand --blah &


Thanks for sharing this. I’d always wondered how NAT works and this explanation worked well for me. Nice bit of investigation too!


For this reason, I use autossh.

I recommend it.


As do I.

Every time I need to set it up, I go through this series of very short and digestible articles.

https://news.ycombinator.com/item?id=10937277 (discussion for the third in a 4-part series).

Never thought to look for an HN discussion on it, but it turns out that there is one.

IIRC, one thing the series doesn't mention is that a particular option needs to be explicitly specified in order to maintain port tunnels. By default, as long as the SSH connection succeeds during an AutoSSH reconnect, it will chug along happily without the port-forward if it's still blocked from the previous connection before the drop.


IPv6 is pretty nice. Once it's working natively on both sides that is.


Same for me. To avoid the problem I'm using tmux inside SSH session.



Frankly I've never seen any connection that didn't use keepalives stay up for very long on my LAN, much less the Internet at large.

Keepalives are a basic requirement for persistent connections. It's literally what they're for. Harmless enough solution.


That’s a bit strange though, don’t you think? TCP state only lives in the endpoints, unless you have something awful like NAT in between. Without NAT, why are keepalives a basic requirement for persistent connections?


Without timeouts, one of the endpoints would have to maintain dead sessions indefinitely. One cannot rely on protocols to close connections properly - stuff happens.


I run under NAT, like most Internet users in the US, although my gateway has a static IP.

No clue why the LAN drops connections without keepalive traffic. I need to get up to speed on using WireShark to diagnose dropped connections one of these days, as I actually have a couple of dropped-connection issues that need troubleshooting.

Typically I'll get a connection timeout every few days, even with keepalives, and even with a hardwired DHCP address for the MAC in question. Connections with no traffic tend to get 'cleaned up' by somebody after a few hours at most.


Back before I know about SSH KeepAlive setting, I have had SSH connection going on for entire workday (~8 hours or so) on my LAN to local server.


>"A SSH session does not generate any traffic, unless there’s new output or input. The same is true for TCP. That is why,

after the TCP and SSH sessions have been established, no more packages are sent for a long time

."


My wireless hub is doing that not my ISP.


Try using Mosh instead.


[flagged]


Of all the shitty things 2020 will be known for, this stupid calling everyone Karen/Ken is the absolute dumbest. Please, for the sake of our species, stop.


Just a repeat of the "Kevin" a few years later. I'm afraid this is never going to stop, though we have to fight it.


Sounds like you both could use a lesson in interpersonal conduct




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: