As an EE turned "software engineer" this bothers me, a lot.
I like the EE part of it, but I prefer thing that change more easily and are more "playful" (not to mention today hardware is at the mercy of software, so you take the reference design and go with it)
But I've come into situations where I uncovered a HW bug (in the chip reference board implementation, no less) that only manifested itself because of something specific in software (in the HDMI standard - or better, things from the standard inherited from things like VESA)
The Software Engineer see ports/memory to be written to and doesn't know what happens behind that
The Hardware engineer sees the "chip" and its connections but doesn't realise the rabbit hole goes deeper "ah this is a simple USB device, only 8 pins" now try communicating with it
That is a VERY general statement, most software engineers who do hardware stuff also know what takes to make it. You can't design a driver for a card without knowing everything about the card. And I'd say the same with hardware engineers. If you don't know how the software is going to run, how are you suppose to architect it. You can't make a piece of hardware without thinking about how the driver will work.
True, but those examples you cited are the minority.
Some software engineer working with drivers are distant from the hardware developers (especially in Linux) and even inside corporations there's a wall somewhere.
And of course, sometimes there's an abstraction between hardware and driver (usually through a firmware). Commonly relating to a standard, like USB storage, ATAPI, etc
" You can't make a piece of hardware without thinking about how the driver will work."
Unfortunately I've had to work with some devices that had very hard requirements on the software (basically, response time) (or you would add extra hardware to deal with it). In the second revision this problem was "fixed" by increasing a certain buffer size.
So yeah, sometimes hardware engineers don't think about that comprehensively enough.
though I agree that it's probably too general of a statement, I'm more with raverbashing on this one: while now mainly doing software I have a strong hardware background and it's more often than not just baffling to see the approach of software-only engineers when having to code over the software/hardware border. Then again, maybe I only met some exceptions. I also have no idea if/how these topics are covered in a typical software engineering's education.
Well, back when I was an MIT undergraduate, one of the core CS classes handed us a breadboard and a bunch of 74XX TTL chips, and we needed to construct a general register and stack computer from that. (We did get some PC boards that gave us a ALU and a UART that plugged into the breadboard, as well as the ability to program an EPROM to implement the firmware, but none of this "here's an Arduino or Rasberry Pi".)
Maybe there's so much complexity in the software stack that we can't start CS majors starting from the hardware level any more, but I can't help thinking we've lost something as a result. These days, there are Java programmers who get that "deer in headlights" look when confronted with terms such as "cache line miss".
> These days, there are Java programmers who get that "deer in headlights" look when confronted with terms such as "cache line miss".
Forget that, most of these folks can't reason about a program that doesn't have automatic garbage collection. Even if they have direct experience with C or similar, I have asked recent grads how they imagine reference counting or malloc/free works, and they very often start pulling out GC-influenced magical thinking about "the system" reclaiming things under the covers.
lol, I see what you mean there. Now as long as the person you're talking to has a true engineering mind he/she will be happy to learn about the subject. But there's unfortunately also those that start looking you with eyes begging you to go stop the hardware mumbojumbo talk and go back to oftware only. I don't really consider them true engineers.
It's 6.004, and it looks like they are still teaching it, which I was pleased to discover. Students are no longer asked to carry around the suitcase-sized breadboard "nerd kit" and they aren't using 74XX TTL chips any more. Now it's all done using simulators.
The name of the course is "Computation Structures", or, as MIT students would know it (since nearly everything at MIT is numbered, including buildings and departments), 6.004:
> You can't design a driver for a card without knowing everything about the card
Strongly disagree with that statement, though I sincerely wish it were true. My company manufactures hardware and does not provide a reference driver for any OS. We provide binary blobs and textual "guidelines".
For our hardware, driver authors operate without knowing any details beyond the interface.
So... they're actually not writing drivers for the hardware, but for the interface provided by your blob. I'm not sure I agree with the original statement, but I don't think your argument holds entirely.
It is always reason to celebrate when one engineer successfully communicates with another of different specialty. Big kudos to Intel for actually encouraging them to do so!
Wait, if I'm reading this correctly there's no safe resume recovery time which can be guaranteed not to cause devices to drop off the bus. The kernel could wait 10 minutes and devices could still require more than that. That seems like a pretty major issue with the USB specification.
If you issue a database query, you have no particular guarantee that it's going to complete in any finite amount of time. At some point, you simply throw up your hands and say it would be unreasonable to wait any longer, and accept the resulting error condition.
No, the spec is saying don't issue a request at all for "at least" 10ms. "At least" because you might have to wait longer, who knows how long, until it's safe to issue a request without triggering a disconnect.
I don't understand that, if devices can take as long as they please then why give the 10ms delay at all? It'd be like telling the hardware engineers "you're not allowed to wake up your device for at least 10ms". Is that just the spec trying to say "this should cover most use cases" and it turns out that's not really the case in practice? Or maybe it's an expected limitation of the technology that it would be nearly impossible to ready a device within 10ms, so the spec is saying it would be pointless to issue a request before that time?
I imagine it's something along the lines of "ok device, you have 10ms to play with the link before anyone's watching, do what you need to do." For example, electrically detecting the wattage of the power supplied over the link by examining its responses to various inputs. (I have no idea whether that's at all relevant here or if it's even the right part of the stack, I'm just offering an idea of why the device might like the controller to say "I'm not watching right now".)
acchow is disagreeing with your database query likeness. Your comment was effectively saying that it's just waiting until it's ready. And this is clearly something that can be disagreed with.
Based on the way this article is worded, it seems like there is no way to check when TRSMRCY is over. Imagine if you were waiting for a database query and if the database wasn't done thinking yet, simply accessing the socket would make the query abort.
The hub knows when the device is ready; just query it. A constant timeout is not needed. A give-up timeout might be employed, but there's no reason that can't be 100's of ms, nobody is waiting on that and it doesn't usually happen anyway.
I can read the website perfectly fine. It may not be shiny like all the iCrap, but it doesn't need to attract "average" users (e.g. spoiled, rich 12-year olds).
I would rather prefer kernel hackers to do something useful, instead of wasting time and money to make every LKML archive look aesthetically beautiful.
Definitely yes. Them or others, whatever. And Not for every issues. But I want to ear about issues that have been worrying for a long time and/or annoyed a lot of people, or that are very complex and shady. And the only way to encourage communication about bugs is to congratulate people for their fixes/problem isolation.
Yes. As an industry, we're not very good at recognizing the great things people do. We tend to focus solely on mistakes. It's not easy to admit failure, and if we want it to continue we should recognize the effort it takes.
There is no "maximum" for a reason. Because it should be evaluated as "hey hardware developer, you will have guaranteed 10 ms from System Software to resume". If you don't wake up in 10ms, you are clearly violating the spec.
9.2.6.2 states:
After a port is reset or resumed, the USB System Software is expected to provide a “recovery” interval of 10 ms before the device attached to the port is expected to respond to data transfers. The device may ignore any data transfers during the recovery interval.
After the end of the recovery interval (measured from the end of the reset or the end of the EOP at the end of the resume signaling), the device must accept data transfers at any time.
Nothing there says that the hardware must be ready at or after 10ms. It simply says that software can't ask for anything before 10ms is up. Software has to wait 10ms, and then might have to wait longer.
Yep, the mentioned 7-14 table makes this very clear: it's a big-ass table of timing names with a column for minimum and a column for maximum (and a bunch of other columns for e.g. timing unit), where either column may be empty (and for many timings only one of them is filled).
In that table, TRSMRCY has a minimum value (of 10ms) but no maximum.
Bear in mind that here, something which is a minimum from the point of view of the OS is a maximum from the device's point of view. If the OS is allowed to use any value above 10ms for TRSMRCY then the device can take at most 10ms to prepare itself because the OS can send a request at any point after that.
That interpretation doesn't seem to make much sense though, because there would be no point in specifying the 10ms at all - it means nothing, and doesn't put any restrictions on anyone.
An alternative interpretation is that by forbidding the host from issuing any commands for that initial period, the target device could (if it happens to be expedient) do all sorts of otherwise spec-violating things, safe in the knowledge that nobody will ever find out[2]. After that grace period, it has to go back to playing by the rules, but doesn't necessarily have to be operational. Whether or not it's operational yet can be queried[1] (after the first timeout), and a decision to use it, retry the query after a fixed or variable additional fallback, or mark it as failed.
A better spec might recommend or mandate values for fallback quanta and repetitions, or a maximum bound on the delay, rather than leaving it vendor-gets-to-choose.
I didn't look too hard, but a decently marked timing diagram would be nice, and might make it easier to spot the unbounded nature of it, rather than having to cross-reference the inline '10ms' value with the minmax table elsewhere.
[1] At least, if I understand it correctly. If accessing the status info uses the same mechanism as general traffic, it's obviously subject to the flaw described, and this interpretation is wrong.
It does indeed put restrictions on the software author not to expect hardware to be ready in less than 10ms. Perhaps the specification is somewhat nonsensical. That does not permit the software author to create restrictive interpretations. IMHO, it informs the software author that flexibility should be allowed in their code.
But under your interpretation, the software can't expect the hardware to be ready in less than 100ms, 10s or 10 minutes either - so the value "10ms" is meaningless.
Indeed. And the software should continue to wait for readiness or decide when to give up. Giving up at the 10ms mark isn't against spec, it's just not prudent.
"After the end of the recovery interval the device must accept data transfers at any time." simply says that "hardware must be ready (in order to accept data transfer)"
But the "recovery interval" is not defined, leaving the device to decide what its "recovery interval" is and guaranteeing that software will not expect it to be less than 10ms.
USB System Software is expected to provide a “recovery” interval of 10 ms before the device attached to the port is expected to respond to data transfers
Is there anything in the spec suggesting that the hardware can take longer than 10ms? Given the phrasing in the spec ...
> 7.1.7.7 Resume
> The USB System Software must provide a [minimum] 10 ms resume recovery time (TRSMRCY) during which it will not attempt to access any device connected to the affected (just-activated) bus segment.
> 9.2.6.2 Reset/Resume Recovery Time
> After a port is reset or resumed, the USB System Software is expected to provide a “recovery” interval of [at least] 10 ms before the device attached to the port is expected to respond to data transfers.
... I would say that thinking the hardware can safely take more than 10 ms seems like a naive interpretation. You may note that system calls like usleep(10) sleep for "at least" 10 ms; there's no upper bound. The spec simply reflects this fairly typical aspect of software.
The true intention of the spec is academic at this point. There are millions upon millions of devices out there with one interpretation and they're not changing. Linux can either increase the grace period or be tarnished as having bad USB suspend.
Well the Linux USB maintainer has spent the last month or so trying to get Linus to be more polite, so I guess those kind of things have a higher priority!
I kid, I kid...
The reason is that it is incredibly difficult to link the disconnect to the cause as the 10ms is likely sufficient in 99% of cases - until it suddenly isn't. This means that you could be running test cases on a certain device for a year, and suddenly the test will fail the day after. When the test case mysteriously fails randomly like that on only a subset of devices, the assumption is that the hardware is faulty. These kind of failures would likely be higher on lower quality, less optimized hardware as well, furthering the perception.
As far as I can tell, the reason this is fixed now is because known good hardware from Intel started exhibiting the same error which got people at Intel to track it down directly, as they knew it wasn't their hardware at fault.
>Out of 227 remote wakeup events from a USB mouse and keyboard:
> - 163 transitions from RExit to U0 were immediate ( < 1 microsecond)
> - 47 transitions from RExit to U0 took under 10 ms
> - 17 transitions were over 10ms
So, 10 ms might indeed be sufficient for 99% of devices. But some devices (i.e. this mouse/kb combo) needs between 10 ms and 12 ms in 8% of all wakeups.
Because nobody cares about suspend-resume power mgmt. If it doesn't work, curse it, pull it out and put it back in again, voila it works.
The people who really care about and study the spec, are those who have to support fixed devices i.e. USB devices internal to an appliance. They physically cannot be removed by the user. So suspend/resume has to work.
Embedded programmers have to deal with totally-broken drivers/specs all the time. There are probably 100s of folks who knew about this and dealt with it (bumped the timeout in their embedded kernel to match the devices they support) and never said anything to anybody.
> This bug has been reproduced under ChromeOS, which is very aggressive about USB power management. It enables auto-suspend for all internal USB devices (wifi and bluetooth), and the disconnects wreck havoc on those devices, causing the ChromeOS GUIs to repeatedly flash the USB wifi setup screen on user login.
Also with sound devices it is more than a tad annoying to unplug a USB external audio interface (powering down your monitors, etc) because it magically disappeared for some reason.
This is an amazing fix if it is the root of the sorts of problems I've seen on Linux (which've kept me crawling back to Mac for hardware support)
When cheap hardware acts like it doesn't follow the spec, no one digs too deep, because it's always going to be quite frequent, and there's nothing you can do about it. It's very rare that it turns out to actually have been following the spec, and you had the spec wrong. That's the practical reason.
I can't speak for kernel developers, but when you have complex and large codebase running on a huge variety of hardware, you will have some edge cases that are rare or difficult to debug. And I don't envy the folks that have to interface directly with hardware, I have enough fun in database land...
Why is that variable set at 10? Who would question that?
The spec says 10 too. It's the "at least 10" part that was missed. That's very subtle, does not stand out and is easily over-looked unless someone is really auditing code and reading specs carefully.
What are the conditions where this problem manifests?
I have a Das Keyboard that sporadically become unresponsive until I unplug and plug it back in. How do I know if my problem is caused by the issue described in the article?
Good we have a fix for a bug, that has been pestering me for quite a long time. As for maximum timeout, I believe that setting a maximum timeout in sysfs with default of 1s should make satisfy most people (unless anyone wants per-device max wait time?)
50ms should be quite enough i think. That's 5X the minimum, more than any proper device should ask for.
If you want to be extreme you can make it 100ms but any more than that is way to extreme.
If you are going to make a statement like that, make it "64ms ought to be enough for anyone". Either way, it won't help. If you write kernel code or interface with unknown hardware, you must be paranoid to the bone to get robust code. Double so if your kernel code talks with hardware you do not control.
"more than any proper device should ask for."
If devices asked for time, things would be easy; you either reply 'no', or you give them the time they ask for. The problem is that they take time without telling you.
This is alarming. If the issue really was that simple, it strongly indicates that Linux kernel developers don't put a lot of effort into investigating problems whenever a convenient scapegoat--faulty hardware--is available. For shame.
Well, that's why you don't hardcode a magic value, nor do you continuously poll the state of a device and rely instead on interrupts: That's what they're made for.
> Well, that's why you don't hardcode a magic value, nor do you continuously poll the state of a device and rely instead on interrupts: That's what they're made for.
There was a mention about this in the OP. There were no interrupts for this state transition in USB prior to USB3. "The Intel xHCI host, unlike the EHCI host, actually gives an interrupt when the port fully transitions to the active state."
In addition, a lot of hardware initialization is based on delays and polling by design.
I'm not familiar with the USB spec. It happened I did tear-downs of flashdrives to find-out which ucontroller there is and try to dump stuff, etc... (For anyone interested, look up VID:PID and Google some microcontroller chip. It's a fascinating world where Russians are very present).
Your second comment about a lot of hw initialization being based on polling gives place to an "a fortiori": Why cascade this necessary evil to software. It's true, in the beginning, writing code for microcontrollers, you avoid working with interrupts. You work with assembly language, and get a bit lazy and hard-code delays (and even then, I'd choose a longer delay than the one specified as "max" in the datasheet : it's not like a chip not performing as well as is written in a datasheet never happened, so I take my precautions).
But then battery life reminds you of bad code practice: This continuous polling is draining. (And you put the uc to sleep :) )
Even more, if the application's involving sensors there's no way around using interrupts. Unless the thing is powered by a wall-socket, but even then, you get your consciousness preventing you from sleeping at night thinking about that horrible, barbarian code you put in there.
But then again, I don't know the USB spec and this may change so it's cool. And they have come a long way.
In fact the 7-14 table mentioned in TFAA is literally a table of magic values to hard-code: it's a big table listing all the timings of section 7 and their minimum and maximum values (either one of which may be absent)
You're forgetting the feeling of smug virtuousness you get when you end up being incompatible because you're more technically correct (the best kind) than the other components you're interacting with.
You get to say the other guys are all wrong, wage wars against them, blacklists, all the usual religious crap etc.
Now can someone fix the embarrassing network bug in Linux. You know, where if you access a link to a network, or access an open networked path after 255 seconds or so ... You receive a network error. It still begs belief that such a fundamental aspect of Linux is broken ...
When I show Linux to newbies and this fault occurs (ie. 100% of the time), I simply say "Linux isn't perfect ..." But inside, I cringe...
It occurs under all distros I've tried and it's been there for years. Even on different computers with different hardware...
I have TCP connections that have stayed open for months so this is highly unlikely to be a kernel issue. I have no idea what you might be running into as I've never seen anything like that occurring.
If you are connecting to a remote system, it could be NAT configured badly on your router.
The router provided by my ISP (Virgin Media) is ruthless at closing idle TCP connections after only a few minutes. I'd see this with idle SSH logins being closed all the time.
The solution (for me at least) was to ensure connections used TCP keepalives, and vastly decrease the keepalive times (various sysctl calls, I don't have the details to hand).
But instead required a bunch of moderately obscure changes to system config which you are bound to forget after a few years when you reinstall / image a new machine. The Virgin Superhub is a crappy barely-consumer grade box from Netgear with firmware written by Virgin. Modem mode is all it's good for. Sometimes the right answer is to spend the money on a decent router - Draytek are passable.
Never had such error. Even in Linux 0.99pl10 (in 1992), network was working fine. I have always used a lot remote access to X servers. If network was broken, Linux would have been unusable.
backing the previous poster here, I've never experienced anything like this at all. "Network error" makes it sound like you're using KDE or GNOME or something, or Samba isn't liking your configuration...
Have you eliminated all other variables E.G. Common utility/setting, cabling, switch/router etc..? I've had paths open for much longer than that without access issues.
not sure what you mean exactly, but one thing that has caused tons of trouble here, with sometimes the sole solution being a restart (ok our IT maintainer might be doing something worng, yet..) is the opposite: take a bunch of workstations and a bunch of servers, put home directories and data on servers then put everything together using NFS shares. Run analysis and whatnot on the data. Then make the server go down somehow and watch all workstation getting completely locked up without seemingly ever generating some kind of timeout error instead waiting endlessly on a dead connection.
That's kinda how NFS was designed to work. NFS comes from the dark ages of networking, where transient network errors were very common, even on local networks.
"man nfs" and "man mount.nfs" and search for "soft", "intr", "tcp", and "timeo". It sounds like the solution to your problem is some combination of those options.
Props to Intel for hiring leading Linux developers and turning them loose.