
The Ethernet PAUSE frame - quandry
http://jeffq.com/blog/the-ethernet-pause-frame/
======
francoisLabonte
Ah... The problems of crappy consumer ethernet equipment ( I work at an
ethernet switch vendor so excuse the rant )

What is likely happening is that your switch is configured by default to
implement both rx and tx pause. What is happening is that your TV who's also
erroneously ( in my opinion ) configured to transmit pause goes bonkers,
starts sending pause to your switch. Your switch then starts buffering packets
for your tv until the buffers are full and then starts transmitting pause to
everyone else including ports. The switch must have some horrible buffering
policy where one port ( the tv port ) can hog all the buffers and deprive
every other ports of being able to send...

Now the kicker is that the way every endstation implements pause is this...
Notice the pause quanta in the Pause packet is in units of 512 bits and in the
packet you captured it is set to the maximal value of 65535 which is on a
100Mbps port ( presuming 100Mbps since the Mediatek has 4x100Mbps (Fast
Ethernet) and 2x1Gbps ( Gigabit Ethernet ) that computes to >>>
512*65535/100.e6 = 0.3355392seconds

A normal Pause sends this packet periodically and once it has buffers to
receive will send a pause with a quanta of 0 meaning cancel previous timer...
but if it's malfunctioning who knows if it ever will...

The sad part is that I don't even know what to recommend for a good consumer
level switch that has good defaults or configurable defaults and sane buffer
config... Mine is a dinky one probably vulnerable to this problem as well...
Need to do some research.

~~~
vardump
> A normal Pause sends this packet periodically and once it has buffers to
> receive will send a pause with a quanta of 0 meaning cancel previous
> timer... but if it's malfunctioning who knows if it ever will...

As with most protocols, it doesn't work if it's not implemented properly.

Many hardware implementations have no knowledge of when the buffer can be
emptied, so it's understandable they treat it as a on/off switch. Screaming,
my buffer is full, don't send me anything for 65535*512 bit-times! Which is
perfectly good, because otherwise all those incoming frames would need to be
dropped anyways.

Remember, small embedded devices can't often guarantee ethernet DMA slot time
to DRAM, and definitely can't afford to have a dedicated DRAM channel, so
those buffers are on a 2-8 kB on-chip SRAM block or equivalent.

When the buffer is "full" (above high water mark), an interrupt gets generated
and device firmware will set up an appropriate DMA transfer to empty it. Once
that is done, the device should of course send a PAUSE 0, and all is good.

> The sad part is that I don't even know what to recommend for a good consumer
> level switch that has good defaults or configurable defaults and sane buffer
> config...

Like you must know, you can turn it off entirely in most managed switches, see
what happens to data transfer speed.

Most consumer level gigabit switches seem to have maybe 16 kB buffer. So they
don't really have much buffers (or anything) to configure.

~~~
francoisLabonte
Like you said the host might have small buffers and without Pause it would
drop, but who's supposed to buffer the packets, the cheap switch with 16kB of
buffers and super idiotic buffer configuration such that everyone else on that
switch gets paused?

You seem to think that it's bad to drop packets in the nic and while some nic
might have buffers that are too small but in general you should drop. If you
use TCP the window will adjust to whatever your bad nic and embedded system
can handle. At least you won't affect the others by spreading pause like a
cancer ( can you tell I am cynical on pause )

Usually on a switch you can usually drop packets based on the number of
packets destined to a port and packets buffered per input port. This is how
you can avoid head of line blocking but again if you are right with 16kB
that's barely enough for a jumbo packet (~9200B)... geez that's depressing.

~~~
vardump
> If you use TCP the window will adjust to whatever your bad nic and embedded
> system can handle.

TCP window, sigh... It can't deal with the situation where, say, every second
frame is lost, because someone thought 2 kB is enough buffer. TCP congestion
control mechanisms are great for actual congestion, but when packet loss is
due to other causes, it's actually pretty bad.

Again, TCP is no substitute for flow control in this case.

Doesn't matter how nice NIC you have. The problem usually happens before the
packets reach your nice NIC.

~~~
fanf2
Lack of buffer space is pretty much the definition of congestion.

------
ChuckMcM
Also one of the only ways to negotiate your way out of a spanning tree
broadcast storm. Generally the firmware on the MAC will reflect a pause frame
to the source when it's FIFO is full. That happens because the host is not
pulling packets out of the FIFO fast enough, or the network has gone bonkers
and is sending a gazillion packets per second.

The latter can happen when your misconfigured DHCP server gives out an address
that other nodes on your network believe to be the broadcast address for the
subnet. The device with that ill fated address will get deluged after every
packet they send as people ack or nak or respond with queries. I saw that
happen when a NetGear router had a netmask of 255.255.255.248 which the user
copied from the WAN config to the LAN config, but the DHCP server was told the
netmask was 255.255.255.0. Hilarity (not) ensued.

~~~
dekhn
This also happens in compltely normal operation, like if you're using a TCP-
based MPI implementation and do an all-versus-all message send. The
destination buffers will fill quickly from all the senders, the receiver drops
the packets, TCP sees that as a timeout after 250ms, and requests a
retransmit. In principle, using PAUSE frames allows the sender to get feedback
to pace its sends.

Took me a long time to debug my MPI performance problems because of this.

~~~
greglindahl
Uh, no. Alltoall is a challenge for MPI, but not for the reason you describe.
TCP windows mean that the receivers aren't the problem. It's all the switch
queues in the middle.

~~~
vardump
TCP windows won't save you. TCP has no way to magically know when some buffer
is full. Instead it notices packet loss and interprets it as congestion. Which
is not what you want, because it can significantly reduce throughput.

~~~
r4um
Yes it does, receiver side advertises window as 0 and persist timer kicks in.
The send call then blocks until window recovers.

~~~
vardump
That's the _problem_ when you don't really even have any congestion, but just
very high packet loss caused by small buffers. Transfer rate drops to nothing.

With pause frames you can avoid that situation.

------
rdtsc
Nobody knows about PAUSE frames until they bite you.

I found out about them when someone at a place I worked wanted to design a
custom Ethernet driver for an embedded device. There was no good reason for
it, could have run the regular one shipped for that device (it was an RPi
equivalent kinda unit).

So there they went and months later, it emerged. Everyone was amazed: oh wow
handcrafted Ethernet driver, impressive.

Except what ensued was months of debugging and wireshark captures. Not
handling PAUSE frame and flooding the network with packets took a good chunk
of that time. Of course it was blamed on stupid switches and broken protocol
and not on the bad decision to re-write a known, well defined and stable
protocol without a good reason to do so.

~~~
marcoperaza
I'm sure it wasn't a very good one, but do you recall what the actual stated
reason for this undertaking was? It surely couldn't have been "just because".

~~~
rdtsc
Low latency processing and speed. But it was done without anyone measuring the
latency and performance parameters of the existing one that's the crazy part.

------
martyvis
The PAUSE frame is meant to be sent by a station (host) to the switch (or vice
versa) as a flow control mechanism, only for that port. Assuming the switch
has at least some egress buffering, it shouldn’t result in propagation away
from that switch port, to say the switch’s uplink port, unless the switch
finds itself completely congested. Most hosts won’t have flow control
configured at layer 2, instead relying on TCP congestion control. It is only
useful when you have non-TCP type traffic, for instance fibre channel over
ethernet, and you want to avoid packet loss and prefer to try force buffering
upstream Reply

------
drewg123
The only "safe" way to configure traditional (not per-priority) pause frames
is to configure the switch to ignore pause frames coming from hosts, and to
configure hosts to obey pause frames coming from switches. With data center
bridging and per-priority pause, some of that goes out the window.

In the early days of 10GbE, I did drivers for a NIC that had a very small rx
fifo. In some cases we had to advise customers to enable pause frames,
otherwise the NIC would be subject to tail drops when the switch burst traffic
to us. I still feel kind of bad about giving out that advise.

------
0x0
That's some great detective work.

But why is the Android TV spamming these pause frames at all?

~~~
drewg123
Probably because the ethernet driver or hardware has a bug, and the rx buffer
is full, and it has been configured to enable pause.

I once took down an entire corp. net by doing serial kernel debugging on a
machine with pause frames enabled. Once the debugger took control of the
kernel, the driver's rx interrupt handler stopped running, and the rx buffers
filled. Eventually, the rx buffers were totally consumed, and the NIC started
to send pause frames rather than dropping the packets. To make matters worse,
I was remote, so I had to call somebody to powercycle the box.

~~~
vxNsr
For extra points use an ip phone system that's on the same network as your
computers.

------
vardump
L2 pause frames are a necessity for small networks that don't have server-
grade hardware, when your MAC receive buffers are measured in a few kilobytes.
That includes most embedded devices (consumer devices, printers, etc., even
industrial), consumer/small business switches, they have tiny buffers.

Your datacenter/server/workstation hardware is different. It deals with higher
speeds and has appropriate buffering and control.

I wish people here would understand consumer/embedded space needs pause frames
to function properly and that TCP congestion control will often significantly
hurt performance otherwise. TCP can't magically know when some buffer is full.
Without pause frames low level hardware will send at full throttle. It's _not_
ok if every second frame is lost.

------
RKearney
I wouldn't call this obscure by any means. You'll find Ethernet flow control
enabled on just about every datacenter network, especially those that have a
combined network and storage fabric.

~~~
mrmagooey
Is there a concern that the mechanism allows for DoS? How do they mitigate the
situation that the author describes?

~~~
toast0
If you have standard compliant hardware, pause is point to point, not
broadcast. You can configure hosts to ignore pause and also to not generate
it; although it may be difficult to configure an embedded device, so you
probably need to fix the switch or replace it with something that works.

------
bogomipz
"The very existence of Ethernet flow control may come as a shock, especially
since protocols like TCP have explicit flow control"

Not at all. Ethernet is ancient and there are other transport protocols
besides TCP that can and have used it in the past Apple Talk, IPX/SPX, DecNet
to name a few. This is the beauty of the OSI model. Ethernet at layer 2 is
independent of what rides it at layer 4.

~~~
lmm
Switches aren't supposed to exist in the OSI model at all, which makes it of
questionable usefulness today - when was the last time you used a purely-
hubbed network?

~~~
bogomipz
I am not understanding the point you are trying to make.

OSI is just a model and it is not limited to end stations. What gives you that
impression?

Conceptually and virtually switches most certainly do exist in the OSI model.

For all practical purposes a switch can be thought of as a multiport bridge.
And as such it exists at Layer 1 and Layer 2.

~~~
lmm
> OSI is just a model and it is not limited to end stations. What gives you
> that impression?

I didn't say anything about end stations? The point is that a switch is
inherently a violation of OSI layering (it uses layer-3 information to make
layer-2 decisions), which given that practically all modern networks are
switched, suggests that the 7-layer model may not be that useful for modelling
real-world networks.

------
balabaster
Does anyone know if this is controllable by software?

i.e. is it something DD-WRT, Tomato et. al. can alleviate?

I have been suffering similar symptoms on my home network and I also have a
Sony Android TV, though to be honest it hadn't occurred to me to bust out
Wireshark to figure out what was going on. In hindsight, I guess this was a
rookie mistake on my part.

------
dfc
I always thought you were not supposed to enable flow control on a network
with mixed 100 and gigabit devices. From the list of things OP has hooked up
to network I would be surprised if they are all operating at 100 megabits.

------
signa11
what if folks start sending PAUSE frames when on wifi ? :)

~~~
rasz_pl
[http://sysnet.ucsd.edu/~bellardo/pubs/usenix-
sec03-80211dos-...](http://sysnet.ucsd.edu/~bellardo/pubs/usenix-
sec03-80211dos-html/aio.html)

[http://hackaday.com/2011/10/04/wifi-jamming-via-
deauthentica...](http://hackaday.com/2011/10/04/wifi-jamming-via-
deauthentication-packets/#comment-472484)

good paper on the subject summarizing different approaches:
[http://alumni.cs.ucr.edu/~kpele/COMST-
preprint.pdf](http://alumni.cs.ucr.edu/~kpele/COMST-preprint.pdf)

afair during 23C3? someone demonstrated NAV jamming, after turning it on
everyones wifi connection in the whole hall dropped to zero speed while the
person demonstrating was happily browsing :)

