
Linux may have been causing USB disconnects - chalst
https://plus.google.com/u/0/116960357493251979546/posts/RZpndv4BCCD
======
bryanlarsen
Note that this bug was found because the software engineer talked to a
hardware engineer.

Props to Intel for hiring leading Linux developers and turning them loose.

~~~
raverbashing
Really

As an EE turned "software engineer" this bothers me, a lot.

I like the EE part of it, but I prefer thing that change more easily and are
more "playful" (not to mention today hardware is at the mercy of software, so
you take the reference design and go with it)

But I've come into situations where I uncovered a HW bug (in the chip
reference board implementation, no less) that only manifested itself because
of something specific in software (in the HDMI standard - or better, things
from the standard inherited from things like VESA)

The Software Engineer see ports/memory to be written to and doesn't know what
happens behind that

The Hardware engineer sees the "chip" and its connections but doesn't realise
the rabbit hole goes deeper "ah this is a simple USB device, only 8 pins" now
try communicating with it

~~~
dubcanada
That is a VERY general statement, most software engineers who do hardware
stuff also know what takes to make it. You can't design a driver for a card
without knowing everything about the card. And I'd say the same with hardware
engineers. If you don't know how the software is going to run, how are you
suppose to architect it. You can't make a piece of hardware without thinking
about how the driver will work.

~~~
stinos
though I agree that it's probably too general of a statement, I'm more with
raverbashing on this one: while now mainly doing software I have a strong
hardware background and it's more often than not just baffling to see the
approach of software-only engineers when having to code over the
software/hardware border. Then again, maybe I only met some exceptions. I also
have no idea if/how these topics are covered in a typical software
engineering's education.

~~~
tytso
Well, back when I was an MIT undergraduate, one of the core CS classes handed
us a breadboard and a bunch of 74XX TTL chips, and we needed to construct a
general register and stack computer from that. (We did get some PC boards that
gave us a ALU and a UART that plugged into the breadboard, as well as the
ability to program an EPROM to implement the firmware, but none of this
"here's an Arduino or Rasberry Pi".)

Maybe there's so much complexity in the software stack that we can't start CS
majors starting from the hardware level any more, but I can't help thinking
we've lost something as a result. These days, there are Java programmers who
get that "deer in headlights" look when confronted with terms such as "cache
line miss".

~~~
uhno
Which class was this, and are any of the notes or project assignments
available online? That sounds like a fun project.

~~~
tytso
It's 6.004, and it looks like they are still teaching it, which I was pleased
to discover. Students are no longer asked to carry around the suitcase-sized
breadboard "nerd kit" and they aren't using 74XX TTL chips any more. Now it's
all done using simulators.

The name of the course is "Computation Structures", or, as MIT students would
know it (since nearly everything at MIT is numbered, including buildings and
departments), 6.004:

[http://6004.mit.edu/](http://6004.mit.edu/)

------
makomk
Wait, if I'm reading this correctly there's no safe resume recovery time which
can be guaranteed not to cause devices to drop off the bus. The kernel could
wait 10 minutes and devices could still require more than that. That seems
like a pretty major issue with the USB specification.

~~~
nknighthb
If you issue a database query, you have no particular guarantee that it's
going to complete in any finite amount of time. At some point, you simply
throw up your hands and say it would be unreasonable to wait any longer, and
accept the resulting error condition.

~~~
acchow
No, the spec is saying don't issue a request at all for "at least" 10ms. "At
least" because you might have to wait longer, who knows how long, until it's
safe to issue a request without triggering a disconnect.

~~~
Zikes
I don't understand that, if devices can take as long as they please then why
give the 10ms delay at all? It'd be like telling the hardware engineers
"you're not allowed to wake up your device for at least 10ms". Is that just
the spec trying to say "this should cover most use cases" and it turns out
that's not really the case in practice? Or maybe it's an expected limitation
of the technology that it would be nearly impossible to ready a device within
10ms, so the spec is saying it would be pointless to issue a request before
that time?

~~~
davidp
I imagine it's something along the lines of "ok device, you have 10ms to play
with the link before anyone's watching, do what you need to do." For example,
electrically detecting the wattage of the power supplied over the link by
examining its responses to various inputs. (I have no idea whether that's at
all relevant here or if it's even the right part of the stack, I'm just
offering an idea of why the device might like the controller to say "I'm not
watching right now".)

------
kalleboo
Be sure to check out the mailing list post linked from the G+ post which
contains more technical details and proposed fixes [http://marc.info/?l=linux-
usb&m=137714769606183&w=2](http://marc.info/?l=linux-
usb&m=137714769606183&w=2)

~~~
milliams
Why on earth is all the text on that website set to font-weight:600 and using
Courier New of all fonts? Incredibly hard to read.

~~~
1337hax0ll
I can read the website perfectly fine. It may not be shiny like all the iCrap,
but it doesn't need to attract "average" users (e.g. spoiled, rich 12-year
olds).

I would rather prefer kernel hackers to do something useful, instead of
wasting time and money to make every LKML archive look aesthetically
beautiful.

hackers != designers

~~~
karlshea
So angry. Did you design Courier New or something?

> 1337hax0ll

> hackers != designers

I guess not, so there's no reason.

~~~
foobarbazqux
I think his point explains git beautifully.

~~~
karlshea
True story.

(I bet all of them had their terminal font set to Courier New Bold while doing
so)

~~~
marshray
I don't know if it's safe here to admit it, but I always kinda liked the xterm
fonts.

Is there something wrong with me?

~~~
foobarbazqux
No, fixed space fonts can be beautiful.

------
alexchamberlain
We should applaud them for standing up and saying "Hey, we cocked up, sorry!"

~~~
scrrr
Nothing wrong with that statement, except I think it should be a given.. Do we
need a pat on the back for doing the right thing? :)

~~~
sghill
Yes. As an industry, we're not very good at recognizing the great things
people do. We tend to focus solely on mistakes. It's not easy to admit
failure, and if we want it to continue we should recognize the effort it
takes.

------
bluesign
this is wrong interpretation actually.

There is no "maximum" for a reason. Because it should be evaluated as "hey
hardware developer, you will have guaranteed 10 ms from System Software to
resume". If you don't wake up in 10ms, you are clearly violating the spec.

9.2.6.2 states: After a port is reset or resumed, the USB System Software is
expected to provide a “recovery” interval of 10 ms before the device attached
to the port is expected to respond to data transfers. The device may ignore
any data transfers during the recovery interval. After the end of the recovery
interval (measured from the end of the reset or the end of the EOP at the end
of the resume signaling), the device must accept data transfers at any time.﻿

~~~
delinka
Nothing there says that the hardware _must_ be ready at or after 10ms. It
simply says that software can't ask for anything before 10ms is up. Software
has to wait 10ms, and then might have to wait longer.

~~~
caf
That interpretation doesn't seem to make much sense though, because there
would be no point in specifying the 10ms at all - it means nothing, and
doesn't put any restrictions on anyone.

~~~
delinka
It does indeed put restrictions on the software author not to expect hardware
to be ready in less than 10ms. Perhaps the specification is somewhat
nonsensical. That does not permit the software author to create restrictive
interpretations. IMHO, it informs the software author that flexibility should
be allowed in their code.

~~~
caf
But under your interpretation, the software can't expect the hardware to be
ready in less than 100ms, 10s or 10 minutes either - so the value "10ms" is
meaningless.

~~~
delinka
Indeed. And the software should continue to wait for readiness or decide when
to give up. Giving up at the 10ms mark isn't against spec, it's just not
prudent.

------
annnnd
Congrats! But how nobody analysed this bug for 8+ years is a bit of a mystery
to me...

~~~
JoeAltmaier
Because nobody cares about suspend-resume power mgmt. If it doesn't work,
curse it, pull it out and put it back in again, voila it works.

The people who really care about and study the spec, are those who have to
support fixed devices i.e. USB devices internal to an appliance. They
physically cannot be removed by the user. So suspend/resume has to work.

Embedded programmers have to deal with totally-broken drivers/specs all the
time. There are probably 100s of folks who knew about this and dealt with it
(bumped the timeout in their embedded kernel to match the devices they
support) and never said anything to anybody.

~~~
masklinn
In this case, the pain point was apparently a ChromeOS device:
[http://marc.info/?l=linux-
usb&m=137714769606183&w=2](http://marc.info/?l=linux-
usb&m=137714769606183&w=2)

> This bug has been reproduced under ChromeOS, which is very aggressive about
> USB power management. It enables auto-suspend for all internal USB devices
> (wifi and bluetooth), and the disconnects wreck havoc on those devices,
> causing the ChromeOS GUIs to repeatedly flash the USB wifi setup screen on
> user login.

------
oakwhiz
This is a very interesting type of bug that I have often seen cropping up
around hardware interfaces in microcontrollers.

------
ape4
Its a good thing that Linux is open and transparent. Good to admit a bug (and
exactly what it is) rather than silently deny then possibly fix.

Also, somebody uses Google+ ?

~~~
davidw
For whatever reason, there seems to be a number of Linux people on Google
Plus, including Linus Torvalds.

~~~
archivator
And Android people, and various YouTube people, and Internet celebrities and
...

That said, G+ is not half bad. The app beats Facebook hands down. Live
Hangouts are also a neat way to engage with your audience.

------
foobarqux
What are the conditions where this problem manifests?

I have a Das Keyboard that sporadically become unresponsive until I unplug and
plug it back in. How do I know if my problem is caused by the issue described
in the article?

~~~
blaenk
For what it's worth, I too have a Das Keyboard (Ultimate) and I don't
experience this problem (Arch 64-bit).

Hopefully that helps narrow down your issue.

~~~
foobarqux
Thanks. I'm on 32-bit Ubuntu 13.04 but problem has presented itself for at
least last several releases.

------
miga
Good we have a fix for a bug, that has been pestering me for quite a long
time. As for maximum timeout, I believe that setting a maximum timeout in
sysfs with default of 1s should make satisfy most people (unless anyone wants
per-device max wait time?)

~~~
Fuxy
50ms should be quite enough i think. That's 5X the minimum, more than any
proper device should ask for. If you want to be extreme you can make it 100ms
but any more than that is way to extreme.

~~~
Someone
_" 50ms should be quite enough i think"_

If you are going to make a statement like that, make it "64ms ought to be
enough for anyone". Either way, it won't help. If you write kernel code or
interface with unknown hardware, you must be paranoid to the bone to get
robust code. Double so if your kernel code talks with hardware you do not
control.

 _" more than any proper device should ask for."_

If devices asked for time, things would be easy; you either reply 'no', or you
give them the time they ask for. The problem is that they take time without
telling you.

~~~
Fuxy
Well given that 90% of hardware out there is supported just file with the
current setting i suspect 50ms would qualify as quite paranoid.

------
codex
This is alarming. If the issue really was that simple, it strongly indicates
that Linux kernel developers don't put a lot of effort into investigating
problems whenever a convenient scapegoat--faulty hardware--is available. For
shame.

------
T3RMINATED
you are probably going to get the middle finger from Linus Torvald and he will
say it was built like this by design and your wrong.

------
Jugurtha
Well, that's why you don't hardcode a magic value, nor do you continuously
poll the state of a device and rely instead on interrupts: That's what they're
made for.

~~~
alexchamberlain
That's not entirely fair; standards are full of hard coded values.

~~~
masklinn
In fact the 7-14 table mentioned in TFAA is literally a table of magic values
to hard-code: it's a big table listing all the timings of section 7 and their
minimum and maximum values (either one of which may be absent)

------
rustynails
Now can someone fix the embarrassing network bug in Linux. You know, where if
you access a link to a network, or access an open networked path after 255
seconds or so ... You receive a network error. It still begs belief that such
a fundamental aspect of Linux is broken ... When I show Linux to newbies and
this fault occurs (ie. 100% of the time), I simply say "Linux isn't perfect
..." But inside, I cringe...

It occurs under all distros I've tried and it's been there for years. Even on
different computers with different hardware...

~~~
stinos
not sure what you mean exactly, but one thing that has caused tons of trouble
here, with sometimes the sole solution being a restart (ok our IT maintainer
might be doing something worng, yet..) is the opposite: take a bunch of
workstations and a bunch of servers, put home directories and data on servers
then put everything together using NFS shares. Run analysis and whatnot on the
data. Then make the server go down somehow and watch all workstation getting
completely locked up without seemingly ever generating some kind of timeout
error instead waiting endlessly on a dead connection.

~~~
rwg
That's kinda how NFS was designed to work. NFS comes from the dark ages of
networking, where transient network errors were very common, even on local
networks.

"man nfs" and "man mount.nfs" and search for "soft", "intr", "tcp", and
"timeo". It sounds like the solution to your problem is some combination of
those options.

~~~
stinos
thanks, will put some more effort into it if it ever happens again

