
New phenomenon breaks inbound TCP policing - RachelF
https://forums.whirlpool.net.au/forum-replies.cfm?t=2530363
======
colanderman
Title is wrong; updates come over TCP that has been modified not to perform
link sharing.

I have seen this with my wife's computer. Though I haven't seen the large-
window phenomenon (haven't looked), what I do see is dozens of active
connections opened to the same destination, which of course defeats both TCP's
link sharing and standard QoS algorithms.

I work around the issue by bucketing any Akamai IP ranges I find into a very-
low-priority queue, and let those TCP connections fight it out. Seems to have
worked well.

For those interested, here are the Akamai IP ranges I use:

    
    
        23.0.0.0/12
        23.32.0.0/11
        23.64.0.0/14
        23.72.0.0/13
        104.64.0.0/10
        2001:428:4403::/48
        2001:428:4404::/48
        2001:428:4405::/48
        2001:428:4406::/48
        2600:1400::/24

~~~
rhubarbquid
It's also not Windows 10 specific, the post also mentions Office updates

~~~
JonathonW
There's a mention somewhere in the middle of the thread that someone had seen
something similar with Apple's software updates, too: maybe it's a more
general Akamai issue?

Even if it is isolated to the servers distributing MS's updates, I'm having
trouble seeing how this could even be MS's fault-- it's the server side (owned
and operated by Akamai) that's misbehaving here.

~~~
colanderman
I'm not convinced either way. Windows Update is the one opening the dozens of
connections (it has to if it's behind a firewall), not Akamai; Akamai's just
serving the data. But Windows Update is possibly just calling out to an Akamai
library that opens the dozens of connections on its behalf.

If there are also congestion-algorithm shenanigans like the forum post
suggests, then Akamai is definitely at fault. But the issue I see is that of
connection hogging.

~~~
kev009
* [https://developer.akamai.com/stuff/Optimization/TCP_Optimiza...](https://developer.akamai.com/stuff/Optimization/TCP_Optimizations.html)

* [https://www.akamai.com/us/en/about/news/press/2013-press/aka...](https://www.akamai.com/us/en/about/news/press/2013-press/akamai-speeds-downloads-and-online-video-quality.jsp)

------
chadnickbok
Hey cool, someone forgetting _yet again_ why we use TCP.

We don't use TCP because its fast. We don't use it because its reliable
(although that's really useful). We use it because _we kept breaking the
internet_. Once you get above a certain threshold, the network can't keep up
with you and packets start getting dropped. The problem is that backing off
just a little doesn't allow the network to recover.

Instead, we need to use exponential backoff in the face of packet loss to
ensure that the network as a whole can recover.

But if you're pretty much the only connection misbehaving, and everything else
backs off, then you can kinda get away with not using exponential backoff. The
problem is that the applications that is was "kinda okay" to do this for was
VOIP and friends, where realtime delivery is really important and exponential
backoff causes noticeable drops in quality.

For a great read about these kinds of issues, check out the TCP-Friendly rate
control RFC:
[https://tools.ietf.org/html/rfc5348](https://tools.ietf.org/html/rfc5348)

~~~
wtallis
> Once you get above a certain threshold, the network can't keep up with you
> and packets start getting dropped. The problem is that backing off just a
> little doesn't allow the network to recover.

Another aspect of this problem is that the network is too hesitant to drop
packets [1], so by the time you've noticed packet loss things have gotten bad
enough that the drastic backoff is needed. Widespread deployment of ECN and
AQM would allow for more rapid feedback before any huge backlog develops, and
consequently a less extreme response to congestion signals could be used.

[1] Arista would rather their 10GbE switches add up to 100ms of queuing delay
per port than drop a packet: [https://lists.bufferbloat.net/pipermail/cerowrt-
devel/2016-J...](https://lists.bufferbloat.net/pipermail/cerowrt-
devel/2016-June/010701.html)

~~~
jfindley
Slightly OT, but that link is _astonishing_.

That anyone can think adding 100ms of latency to a 10Gbe switch, even under
heavy contention is a good feature is absolutely staggering.

~~~
iofj
It's not quite 100ms. A bit less. The explanation is simple : if tcp
exponential backoff fires, you will have a very bad time on any tcp
connection. Site owners, obviously, don't want that.

Try this : iptables -A INPUT -m statistic --mode random --probability 0.001 -j
DROP

And see how your internet works. TLDR: sometimes loading times go through the
roof, some instant messages go through in <0.1s, and on occasion it takes 30+
seconds, on occasion it's a DNS query that gets dropped and a page load
suddenly takes 1 minute for no identifiable reason, large downloads always
"get fucked" (suddenly lose 90% of their bandwidth and take several minutes to
recover). Burstly traffic doesn't work. If you start your firefox with 20+
tabs open 80% of them will never load.

You will not enjoy the experience.

So yes, people think that adding 100ms of latency is better than dropping a
packet under contention.

~~~
wtallis
Your numbers are ridiculous. There's a huge gulf between buffering _millions_
of packets per switch port before a single drop, and a 1 in 1000 drop
probability. You're also assuming that the drops are indiscriminate when a
refusal to consider AQM and fair queuing is what led Arista to this absurdity
in the first place, and you're presuming that latencies would still be
astronomical in a world without massive queues.

A 10GbE network in a datacenter without bufferbloat would have RTTs orders of
magnitude smaller than the 100ms queuing delay Arista considers acceptable;
the effects of a congestion event would be ancient history by the time
Arista's queues could drain. Even outside the datacenter, 100ms is a pretty
long time for most connections in a managed-queue world. A congestion event on
a device using fq_codel won't kill your DNS request or TCP handshake; it'll
slow down an established flow and if you're using ECN you won't even lose a
packet. It's only in a DDoS-like scenario of thousands of unresponsive
connections (such as TCPs with a large initial window) beginning to transmit
simultaneously that you'd see some flows getting unfairly penalized, but
things would equalize within a few RTTs if the traffic was real TCP and not a
true DDoS. You only see it take _minutes_ for a download's throughput to
recover if you're going over multiple satellite links or through a severely
bloated queue.

~~~
iofj
Okay make it one in a million. You will still be able to tell, and still see
the phenomena I'm talking about.

------
chopsywa
I was the OP in the Whirlpool post. I use Mikrotik extensively to manage
connections and have done for many years. This has only become an issue
recently and it has happened enough times now for me to ascertain that it is
when there is a Windows 10, or Office 2016 using the new Windows update doing
its thing. I have tried to limit the issue by creating limited new tcp
connections per second to any given IP address and even limit maximum
concurrent connections.I have seen on occasion during this issue occuring a
sudden huge burst of new outbound connections. I was thinking this would cause
a type of DOS attack with thousands of SYN ACKS coming back.

The real kicker is that the connections are all to servers (Akamai) on port
80, so any serious blocking breaks all web browsing. The cynic in me says the
whole Windows 10 update thing has been made to operate in lockdown
environments when non-well known ports are blocked. Intentional, or not, the
Internet is basically broken while this happens as Windows is ubiquitous and
people all over the world who have successfully used inbound rate limiting to
create successful shared Internet connections are going to be getting angry
support calls. I hope my post goes viral so it starts to get seen by the likes
of Microsoft and Akamai engineers. The local ISP I spoke to where I initially
noticed this problem pretty much fobbed me off with the old "nobody else has
reported the issue."

~~~
riskable
Well, the good news is that this is a temporary problem. There's only so many
computers out there that will get the Windows 10 upgrade and presumably that
number will drop like a rock as PCs either get upgraded or the users switch to
Linux ;)

~~~
MertsA
Windows 10 updates going forward will use the same method.

------
jsnell
Where did the HN submission title get UDP from? I don't see anyone in that
thread suggesting that the updates were done in UDP, and all the traffic in
the trace file is all TCP.

The trace is indeed a total mess, but I'm not convinced it's anything to do
with TCP acceleration. There's absolutely massive levels of reordering and
packet duplication happening in ways which are not consistent with TCP
acceleration at all. It's much more likely that it's some kind of
configuration problem elsewhere in the network.

From eyeballing the trace, almost half the payload segments there are
duplicates, while a much smaller proportion are retransmits. (You can tell the
difference e.g. using IP ids or by TCP timestamp TSvals / TSecrs).

~~~
dang
We changed the title back to what the article says. The submitted title was
"Windows 10 updates via UDP bypassing QoS restrictions".

Submitters: the HN guidelines ask you please not to rewrite titles except when
they are misleading or linkbait. It seems like in this case the rewrite made
it more misleading.

If anyone suggests a better title, we can change it again.

~~~
RachelF
Apologies for that, I was trying to summarize the linked article in the title,
not to mislead.

------
voltagex_
PCAP is at [https://forums.whirlpool.net.au/forum-
replies.cfm?t=2530363&...](https://forums.whirlpool.net.au/forum-
replies.cfm?t=2530363&p=3#r52). It's not UDP.

I'll have to see whether this is what causes my 100 megabit downlink to behave
as if it's capped at 30 megabits sometimes. The router can barely keep up as
it is.

~~~
riskable
I've had a problem similar to this with my 150 megabit downlink (sigh, wish I
had that much upload). The thing to remember when troubleshooting these sorts
of problems is this:

    
    
        Not all iptables rules are created equal.
    

State tracking has _significantly_ more overhead than other types of
rules/filtering/shaping (even though state tracking is required for certain
types of shaping). You may or may not have already done this but if not try
this:

    
    
        iptables -t raw -A PREROUTING -i eth0 -p tcp -m tcp --dport 443 -j NOTRACK
        iptables -t raw -A PREROUTING -i eth0 -p tcp -m tcp --dport 80 -j NOTRACK
    

(replace eth0 with your Internet interface)

State tracking on port 443 and 80 is mostly useless and it's where you're
likely to see the majority of your (high bandwidth) traffic. Setting NOTRACK
on those ports can make a HUGE difference while still enabling your squirrel-
powered router (let me guess: It requires no active cooling? haha) to shape
traffic like "teh big boys."

~~~
voltagex_
For years and years I was on ~8 megabit ADSL. I thought 100 megabit fibre
would solve all my problems but it just moves my problems into a different
class.

Thanks for that info. As soon as I can work out why the default congestion
control / single connection speed on FreeBSD 10 is so bad, I'll be running
that as a router.

------
NeutronBoy
I've actually seem the same thing recently - updates will soak up all
available bandwidth, to the point where web browsing is basically impossible.

~~~
xufi
For clarification, Is it soaking up bandwith because its doing the (I forget
the term) where it uses your connecion for other Windows 10 users while they
update?

~~~
colanderman
"Peer-to-peer transfer" is the term. You mean this hellspawn?
[https://www.akamai.com/us/en/solutions/products/media-
delive...](https://www.akamai.com/us/en/solutions/products/media-
delivery/netsession-interface-faq.jsp)

I have seen my wife's Windows 10 computer upload crap from time to time and I
suspected but could not confirm that it is this. How is this even possible,
given that we're behind a NAT? TCP simultaneous open, I suppose?

I've not yet been able to figure out how to block these shenanigans. I
_really_ don't appreciate MS/Akamai profiting off of my (rather limited)
upload bandwidth.

~~~
xufi
Ah yes thats the term, I believe (not for downloading) but another MS app
Skype uses the same thing when youre talking to someone on a lower
bandwith/ADSL connection. I believe it uses the same TCP mechanism of some
sort which I havent looked in to.

~~~
jodrellblank
SuperNodes, which Skype used up until Microsoft bought them out (in May 2011)
and stopped doing that (in April 2012)?

[http://arstechnica.com/business/2012/05/skype-
replaces-p2p-s...](http://arstechnica.com/business/2012/05/skype-
replaces-p2p-supernodes-with-linux-boxes-hosted-by-microsoft/)

~~~
xufi
Oh I see. I figured they still used them. Skypes Ui sure has suffered sadly.

~~~
riskable
It's not just the UI that suffered. The latency and performance of Skype calls
dropped significantly after they switched from P2P to Microsoft Notification
Protocol 24 (aka MSN Messenger Protocol)...

[https://en.wikipedia.org/wiki/Microsoft_Notification_Protoco...](https://en.wikipedia.org/wiki/Microsoft_Notification_Protocol)

They also introduced serious security problems when they made that change...
So instead of Skype messages going from one client directly to another they go
through Microsoft's servers (where they are stored and intercepted by TLAs)
_unencrypted_.

They also introduced a new "feature" whereby their systems read everything you
write:

[http://www.h-online.com/security/news/item/Skype-with-
care-M...](http://www.h-online.com/security/news/item/Skype-with-care-
Microsoft-is-reading-everything-you-write-1862870.html)

------
0XAFFE
It's interesting how on the forum this is totaly not about Windows 10, but
more about Akamai and here all people are bashing on Windows.

~~~
chopsywa
Sadly my findings are that it is the new Microsoft update protocol and Akamai
in combination.

------
thomas-b
I've seen on multiple win10 laptops where it just get all download bandwidth
(no noticeable upload) to a point where websites won't open and skype looses
connection. That's on a 5Mb line. I do see many connections opened by a single
process on Microsoft IPs mentioned.

I believe it was really only update related but I saw it happen when actually
no updates were available. I just ended up limiting the corresponding process
bandwidth whenever it gets annoying.

One into the other, I'm mainly just very surprised this kind of thing can
happen. Except that I'm fairly happy with win10 though as opposed to the usual
MS bashing we can hear. One thing into

------
mcguire
A key part:

" _It was the same range of source addresses and this was with Windows server
and then Office updates. What seems to be happening is that instead of the
sending server reducing its window size when packets are dropped, it just
keeps re-sending large windows, which are obviously being dropped at my end.
The queue algorithm has no idea of this and it will be letting packets through
at a rate it thinks is correct, so the flow continues even though much of the
traffic is dropped. However as the traffic keeps coming, the link is totally
saturated._ "

Translation: someone broke TCP flow control.

------
wmf
Windows has a feature to perform _low_ priority downloads of updates called
Background Intelligent Transfer Service: [https://technet.microsoft.com/en-
us/library/cc776905(v=ws.10...](https://technet.microsoft.com/en-
us/library/cc776905\(v=ws.10\).aspx)

There's also the Windows Update Delivery Optimization P2P feature:
[http://windows.microsoft.com/en-us/windows-10/windows-
update...](http://windows.microsoft.com/en-us/windows-10/windows-update-
delivery-optimization-faq)

~~~
colanderman
The problem with BITS is that it assumes that the computer it's on is the only
computer on the network. (In a switched network, how could it know otherwise?)
So it will start bulk downloads while other users are trying to use the
network interactively, with predictable consequences.

I basically just configure my network as if everything connected to it – both
clients and the ISP – are bad actors trying to DoS me (albeit, a DoS regulated
by TCP Vegas). Between my ISP's bufferbloat and these shenanigans from
MS/Akamai, it's not a bad approximation.

~~~
wmf
IIRC BITS used Vegas-like (or these days we'd say uTP-like) heuristics to
detect cross traffic and back off, but who knows what the latest version is
doing.

------
noonespecial
This isn't a new problem for us VOIP'ers. Bad actors have been breaking TCP
for a while. We solved by putting a great big server/router up at a datacenter
with a fat link. Any connection that misbehaves by ignoring TCP drops or
flooding with UDP for more than a few seconds gets rerouted through that
server. (New connections are made through this route.) It has _outbound_ rate
limiting down the pipe to the local networks.

Keeps the local networks happy and fast. Isn't as expensive as routing
everything through a datacenter because only misbehaving IP's get rerouted.
Had the added side benefit that I could "protect" offices from the
"involuntary" win10 upgrade.

------
Sami_Lehtinen
New phenomenon? I think it was first VOIP apps I ever used with Windows 3.11
which already had some protocol level tricks to work around TCP limitations.
Like using ICMP and or UDP traffic. Hardly news. I've been thinking this since
reading about TCP for the very first time. I've even published a concept of
"Bandwidth Hog", which is transfer protocol designed to optimize your
transport. By stealing others bandwidth, aka not sharing it fairly.
[http://www.sami-lehtinen.net/blog/bandwidth-hog-quick-
concep...](http://www.sami-lehtinen.net/blog/bandwidth-hog-quick-concept-
draft)

~~~
colanderman
It's "new" in that it's been deployed by the most widespread desktop OS, so
networks _without_ compromised machines or bad actors have to deal with it
now.

------
KaiserPro
Ah, this looks like using multiple TCP streams to counter latency based
throughput. Its not really new, its used in VFX to shuttle large amounts of
data about.

------
uudecode
Is it not OK to criticize Microsoft on HN? This title was changed to remove
any mention of Windows. Just curious.

~~~
e40
I think it's that they (HN mods) like to use the original title.

~~~
MertsA
Especially when the submitted title contains relevant context for the article.
Can't be having that.

~~~
mikeash
The submitted title was just plain wrong.

~~~
MertsA
After looking at what the original title was, I agree. In this case it needed
to be changed but just dropping the part about UDP would have been fine. In
most cases though it honestly seems like when HN titles are changed it's
changed for the worst. It looks like I'm not the only one who feels that way
either.

[https://news.ycombinator.com/item?id=4102013](https://news.ycombinator.com/item?id=4102013)

~~~
mikeash
Go ahead and fight it in those cases, but this is not at all good example.

------
0x0
Really, pushing Windows 10 has now become so urgent we can't let TCP slow us
down?!

