
Service provider story about tracking down TCP RSTs - cnst
https://mailman.nanog.org/pipermail/nanog/2018-September/096871.html
======
sleepydog
I've never heard of routers using the ttl field for hashing, why would that
ever be useful?

Also it seems an incorrect use of anycast to terminate the same flow at
different machines. At Google, anycast traffic goes through clusters of
maglevs that directs individual flows to the same endpoint consistently.

[https://ai.google/research/pubs/pub44824](https://ai.google/research/pubs/pub44824)

Also I'd like to echo that Fastly's NOC is great. Super responsive and smart.

Disclaimer: I work at Google

~~~
mcpherrinm
If you're doing multi-tier ECMP with the same hashing algorithm at each tier,
using TTL can ensure you don't get polarization issues. Though there's plenty
of other ways to avoid polarization, this works out of the box.

I have heard fastly competitors tut-tut their use of anycast as potentially
unreliable in the face of issues like this.

~~~
bogomipz
What is "polarization" here? I am not familiar with this term applied to
routing? Oscillating between different available paths?

~~~
mcpherrinm
Consider a case where you have a router balancing over four links. It chooses
which link based on a hash of some information from the packet.

Now imagine you have four more routers on each of those links, hashing out
over four more links. So you have a tree with 16 outputs.

If the second-tier router uses the same hash algorithm as the first one, all
the packets it receives will hash to the same link, because it's doing the
same calculation as the router before it.

Thus the 2nd tier of four routers will only use 4 of their outputs, instead of
all 16.

~~~
bogomipz
I think you are describing a CLOS(spine and leaf) network data center topology
here.

>"Now imagine you have four more routers on each of those links, hashing out
over four more links. So you have a tree with 16 outputs."

I'm not really understanding this as a link exists exactly between only two
routers do you mean "path" instead?

>"If the second-tier router uses the same hash algorithm as the first one, all
the packets it receives will hash to the same link, because it's doing the
same calculation as the router before it.

And this is kind of the tax of flow based preservation which is fine compared
to the price of TCP reordering no? Efficient hash-based ECMP utilization is
going to be a function of the distribution of source IP and port in the 5
tuple used in hashing. You can see this outside of EMCP for example when
running LVS with the hashing algo and you have customers that are all behind
the same NAT box. But also there's nothing stopping you from using different
hashing on your spine tier than you do on your leaf tier. You could assign a 4
tuple on one and a 5 tuple on the other.

At any rate a common CLOS ECMP design with BGP is to put each ToR switch in
its own ASN and then load balance across ASNs. So using your example and a 3
tier CLOS network. If Tier 2 had the 16 outputs then the a router should ECMP
over the 16 different ToR ASNs to the destination.

~~~
sleepydog
> And this is kind of the tax of flow based preservation which is fine
> compared to the price of TCP reordering no? Efficient hash-based ECMP
> utilization is going to be a function of the distribution of source IP and
> port in the 5 tuple used in hashing.

mcpherrinm's example is worse than what I think you're suggesting. Because a
given second-tier router will _only_ receive packets that hash to a specific
set of values, if that router has the same number of downstream links that the
first tier router had, it will send __all __packets it receives to the same
link, ignoring other links. Which is a terrible trade-off for avoiding OOO
packets.

The more reasonable trade-off is giving up on utilizing all routers for a
__single __flow, to avoid OOO packets.

~~~
bogomipz
>"The more reasonable trade-off is giving up on utilizing all routers for a
single flow, to avoid OOO packets."

Huh? No you would never want to use "all" routers for a single flow anyway.
Each router just needs to make a deterministic selection for each packet. The
alternative to a hash based scheme would be per packet load balancing which is
practically never used b/c it gives you TCP packet reordering.

>"if that router has the same number of downstream links that the first tier
router had, it will send all packets it receives to the same link, ignoring
other links. Which is a terrible trade-off for avoiding OOO packets"

No it would not be a terrible trade off. Optimizing for maximum link
utilization only matters if you have congestion in your network and even then
ECMP is congestion agnostic. In reality your leaf network has less downstream
link that it has upstream. A common topology is 4x4x2 where each leaf node has
two downstream links to two ToR switches.

------
Animats
A consistent route for every packet in a flow is considered desirable, but
it's not an entitlement or a requirement of the IP protocol. An DDOS
prevention tool that insists that TTL be the same for every packet is broken.
Here's one that checks for inconsistent TTL, but it has some tolerance for
variation.[1] The one mentioned in the original post didn't like a difference
of 1 in TTL.

[1] [https://fortiguard.com/encyclopedia/ips/12934/tcp-ttl-
evasio...](https://fortiguard.com/encyclopedia/ips/12934/tcp-ttl-evasion)

~~~
_wmd
I'm not a networker, but sounds like from a correctness standpoint, a problem
on Fastly's end -- they're reusing frontend IPs for distinct sets of machines,
and traffic directed to the 'wrong' PoP is dropped hard rather than attempting
any kind of internal routing

Of course that kind of routing would create a potential bottleneck for an
attacker to exploit ("simply" force traffic to the wrong IPs to the wrong PoP,
assuming $attacker had this level of access to the backbone), but that's the
problem Fastly are supposedly paid to deal with

Their scheme is fine and dandy with a protocol like DNS where UDP retries are
transparent and TCP is a tiny fraction of weird traffic, but for business
applications handling credit cards, surely the occasional RST is already too
many

Or another way to look at it, basically they're saying their IP addresses are
special snowflakes and actually the full address includes the route, and
source networks are wrong for assuming things work the way they're supposed to
everywhere else on the Internet

~~~
windowsworkstoo
Not really. Anycast is fairly standard and usable for stateful connections -
the issue is again middleboxes fucking with stuff and a weird default of
incorporating TTL in the ECMP hashing algo

~~~
toast0
Expecting all flows to have all packets on a flow delivered via the same path
is extremely optimistic.

------
jdwithit
We run Arista 7280SR's on our edge. I went looking, and as far as I can tell,
the TTL is NOT part of the default ECMP algorithm on this platform (Jericho
chipset)? We certainly haven't tuned this setting one way or another. I would
love for someone more experienced with Arista kit to weigh in on this, since
it seems like it may be platform dependent. We do have a support contract so I
can reach out to them directly, too.

These commands are the best I could find browsing the Arista docs/blog/forums.

edit: fixed code formatting

edit2: We're on the EOS 4.20.x version train, fwiw

    
    
      #show port-channel load-balance jericho fields | grep TTL
      IP TTL field hashing is OFF
    
      #show load-balance profile (output snipped for brevity)
      ---------- default (global) ----------
      IPv4 hash fields:
         Source IPv4 Address is ON
         Protocol is ON
         Time-To-Live is OFF
         Destination IPv4 Address is ON

~~~
frnkblk
The Arista units are 7504N which are running 4.20 or later, and I believe they
use the Arad or Petra chpiset.

------
mirimir
I wonder how such anycast setups would deal with MPTCP.

~~~
toast0
I don't think MPTCP is going to interact very well with modern high load
sites. Even using unicast, it's going to be hard (impossible?) to ensure that
all the individual flows make it to the same nic queue, which is what you'd
really want for performance. I suspect this is part of the reason why it
hasn't caught on very much. (Also, Google doesn't seem to want to invest in a
good tcp stack on Android, instead they put an additional tcp layer (http/2)
and then built tcp on top of udp (quic) ).

~~~
mirimir
Sad. Apple seemed interested, at one point anyway.

But damn, it does work amazingly well site-to-site. I've managed ~50 Mbps
throughput using bbcp (four streams) between Tor .onion services via OnionCat.
And ~190 Mbps total from one source transferring simultaneously to five target
servers. Each peer had six .onion services.

With six OnionCat interfaces per peer, in MPTCP full-mesh mode, there are up
to ~36 subflows per TCP connection. So using bbcp with four streams, there are
as many as ~150 tcp6 connections via Tor per bbcp transfer. And with five
simultaneous transfers, the MPTCP kernel in the source VPS was managing up to
~750 tcp6 connections. That's impressive!

[https://ipfs.io/ipfs/QmUDV2KHrAgs84oUc7z9zQmZ3whx1NB6YDPv8ZR...](https://ipfs.io/ipfs/QmUDV2KHrAgs84oUc7z9zQmZ3whx1NB6YDPv8ZRuf4dutN/)

~~~
toast0
Apple is still pushing it, which is great. I'm just not sure how well it's
going to scale -- MPTCP adds an extra layer of indirection, and extra locking
in the easy case where it's just one server handling an IP. In the load
balancing case, people are going to have to teach their load balancers a lot
of new tricks to get the subflows aligned. If using anycast, the client is
likely to be using multiple networks, so using the same server address seems
likely to get to a different pop; exposing a pop specific extra server ip
seems like something people don't want to do, since exposing that may make it
easier to do a single pop.

I've been debugging an issue where incidentally I'm hitting 1gbps on a single
tcp connection (server to server, with tls), so I'm not sure why MPTCP is
required? ;) But I guess if we had it, I would probably hit 2gbps instead of
being capped by the one nic.

~~~
mirimir
Where it gets useful at consumer level is when your phone can hit WiFi and 4G
simultaneously, so you can aggregate. And it's even more useful when both WiFi
and 4G are iffy, so you seamlessly use one, the other, or both.

And yes, if both of your servers have to gigabit NICs, you can get 2 Gbps. But
only if those uplinks aren't bottlenecked at 1 Gbps at the rack or data center
level.

------
CKN23-ARIN
TL;DR: Incapsula, Imperva, Sucuri, Fastly, and likely others are using anycast
incorrectly.

~~~
windowsworkstoo
Not really...The issue is more using TTL in your ECMP hash algo. It’s a weird
default to have.

~~~
CKN23-ARIN
That's what caused the issue to surface, but the root cause is offering a
stateful service over anycast. There is no hard requirement that you ECMP
flows consistently. Spraying may be sub-optimal, but it must be accepted.

I do agree that including TTL by default is weird, though.

~~~
xkgt
Am I the only one who sees a different issue here? The problem is neither
stateful service over anycast nor TTL based hashing.

Being a DDoS service, one can imagine the need to have stateful POP since each
edge needs to track the TCP state in order to provide DOS protection. At the
same time, it is understandable that the state can't be replicated at scale.

As for including TTL in hashing algorithm, it is aimed to solve link under-
utilization so it is also a valid implementation.

The real bug here is Arista CPE mangling the TTL bits for the Client Hello
packets. I always hate it when networking gear meddles with the protocol
stack. Sure it gives some flexibility but time and time again, it ends up
breaking something somewhere in the path since much of internet networking is
a pile of assumptions. Tampering with protocol fields unilaterally is going to
break someone's assumptions somewhere down the path.

~~~
frnkblk
To clarify, the Arista routers are the ISP's border routers. It was the
residential/business customers' SOHO routers, of various makes and models,
that were not decrementing the initial TCP SYN.

~~~
xkgt
Thanks for pointing out. Sorry I overlooked that part. Perhaps I got primed by
the opening statements which implied that problem happened only after placing
new Arista routers and hastily assumed that it was Arista's routers that
mangled the bits.

~~~
frnkblk
You are correct, the problem started only after placing new Arista border
routers. The previous border routers were not doing ECMP. The issue was a
combination of the use of anycast, diverse Internet transit egress, this model
of Arista defaulting to using the packet's TTL in its ECMP hash calculations,
and the end-customer router CPE egressing packets that are part of the same
TCP connection with variable TTL values. Change any one of those items and the
issue would not have shown up.

