Linux may have been causing USB disconnects

bryanlarsen · on Aug 22, 2013

Note that this bug was found because the software engineer talked to a hardware engineer.

Props to Intel for hiring leading Linux developers and turning them loose.

raverbashing · on Aug 22, 2013

Really

As an EE turned "software engineer" this bothers me, a lot.

I like the EE part of it, but I prefer thing that change more easily and are more "playful" (not to mention today hardware is at the mercy of software, so you take the reference design and go with it)

But I've come into situations where I uncovered a HW bug (in the chip reference board implementation, no less) that only manifested itself because of something specific in software (in the HDMI standard - or better, things from the standard inherited from things like VESA)

The Software Engineer see ports/memory to be written to and doesn't know what happens behind that

The Hardware engineer sees the "chip" and its connections but doesn't realise the rabbit hole goes deeper "ah this is a simple USB device, only 8 pins" now try communicating with it

dubcanada · on Aug 22, 2013

That is a VERY general statement, most software engineers who do hardware stuff also know what takes to make it. You can't design a driver for a card without knowing everything about the card. And I'd say the same with hardware engineers. If you don't know how the software is going to run, how are you suppose to architect it. You can't make a piece of hardware without thinking about how the driver will work.

raverbashing · on Aug 22, 2013

True, but those examples you cited are the minority.

Some software engineer working with drivers are distant from the hardware developers (especially in Linux) and even inside corporations there's a wall somewhere.

And of course, sometimes there's an abstraction between hardware and driver (usually through a firmware). Commonly relating to a standard, like USB storage, ATAPI, etc

" You can't make a piece of hardware without thinking about how the driver will work."

Unfortunately I've had to work with some devices that had very hard requirements on the software (basically, response time) (or you would add extra hardware to deal with it). In the second revision this problem was "fixed" by increasing a certain buffer size.

So yeah, sometimes hardware engineers don't think about that comprehensively enough.

stinos · on Aug 22, 2013

though I agree that it's probably too general of a statement, I'm more with raverbashing on this one: while now mainly doing software I have a strong hardware background and it's more often than not just baffling to see the approach of software-only engineers when having to code over the software/hardware border. Then again, maybe I only met some exceptions. I also have no idea if/how these topics are covered in a typical software engineering's education.

tytso · on Aug 22, 2013

Well, back when I was an MIT undergraduate, one of the core CS classes handed us a breadboard and a bunch of 74XX TTL chips, and we needed to construct a general register and stack computer from that. (We did get some PC boards that gave us a ALU and a UART that plugged into the breadboard, as well as the ability to program an EPROM to implement the firmware, but none of this "here's an Arduino or Rasberry Pi".)

Maybe there's so much complexity in the software stack that we can't start CS majors starting from the hardware level any more, but I can't help thinking we've lost something as a result. These days, there are Java programmers who get that "deer in headlights" look when confronted with terms such as "cache line miss".

asveikau · on Aug 22, 2013

> These days, there are Java programmers who get that "deer in headlights" look when confronted with terms such as "cache line miss".

Forget that, most of these folks can't reason about a program that doesn't have automatic garbage collection. Even if they have direct experience with C or similar, I have asked recent grads how they imagine reference counting or malloc/free works, and they very often start pulling out GC-influenced magical thinking about "the system" reclaiming things under the covers.

stinos · on Aug 22, 2013

"deer in headlights"

lol, I see what you mean there. Now as long as the person you're talking to has a true engineering mind he/she will be happy to learn about the subject. But there's unfortunately also those that start looking you with eyes begging you to go stop the hardware mumbojumbo talk and go back to oftware only. I don't really consider them true engineers.

uhno · on Aug 22, 2013

Which class was this, and are any of the notes or project assignments available online? That sounds like a fun project.

tytso · on Aug 23, 2013

It's 6.004, and it looks like they are still teaching it, which I was pleased to discover. Students are no longer asked to carry around the suitcase-sized breadboard "nerd kit" and they aren't using 74XX TTL chips any more. Now it's all done using simulators.

The name of the course is "Computation Structures", or, as MIT students would know it (since nearly everything at MIT is numbered, including buildings and departments), 6.004:

http://6004.mit.edu/

kps · on Aug 22, 2013

Sounds like http://www.nand2tetris.org/

Sheepshow · on Aug 22, 2013

> You can't design a driver for a card without knowing everything about the card

Strongly disagree with that statement, though I sincerely wish it were true. My company manufactures hardware and does not provide a reference driver for any OS. We provide binary blobs and textual "guidelines".

For our hardware, driver authors operate without knowing any details beyond the interface.

shdon · on Aug 23, 2013

So... they're actually not writing drivers for the hardware, but for the interface provided by your blob. I'm not sure I agree with the original statement, but I don't think your argument holds entirely.

miga · on Aug 22, 2013

It is always reason to celebrate when one engineer successfully communicates with another of different specialty. Big kudos to Intel for actually encouraging them to do so!

fixedd · on Aug 22, 2013

Sarah's pretty sharp. IIRC, she single handedly built Linux's USB 3 support.

SimHacker · on Aug 22, 2013

She must use VI instead of Emacs. Emacs requires more hands for all the modifier keys.

shabble · on Aug 22, 2013

http://www.emacswiki.org/emacs/FootSwitches

sliverstorm · on Aug 22, 2013

I can't tell. Is this a joke? Is this for real?

laichzeit0 · on Aug 23, 2013

It's real. I think jwz uses foot pedals. People should just remap annoying keys, e.g. CAPS -> CTRL

makomk · on Aug 22, 2013

Wait, if I'm reading this correctly there's no safe resume recovery time which can be guaranteed not to cause devices to drop off the bus. The kernel could wait 10 minutes and devices could still require more than that. That seems like a pretty major issue with the USB specification.

nknighthb · on Aug 22, 2013

If you issue a database query, you have no particular guarantee that it's going to complete in any finite amount of time. At some point, you simply throw up your hands and say it would be unreasonable to wait any longer, and accept the resulting error condition.

acchow · on Aug 22, 2013

No, the spec is saying don't issue a request at all for "at least" 10ms. "At least" because you might have to wait longer, who knows how long, until it's safe to issue a request without triggering a disconnect.

Zikes · on Aug 22, 2013

I don't understand that, if devices can take as long as they please then why give the 10ms delay at all? It'd be like telling the hardware engineers "you're not allowed to wake up your device for at least 10ms". Is that just the spec trying to say "this should cover most use cases" and it turns out that's not really the case in practice? Or maybe it's an expected limitation of the technology that it would be nearly impossible to ready a device within 10ms, so the spec is saying it would be pointless to issue a request before that time?

davidp · on Aug 23, 2013

I imagine it's something along the lines of "ok device, you have 10ms to play with the link before anyone's watching, do what you need to do." For example, electrically detecting the wattage of the power supplied over the link by examining its responses to various inputs. (I have no idea whether that's at all relevant here or if it's even the right part of the stack, I'm just offering an idea of why the device might like the controller to say "I'm not watching right now".)

sliverstorm · on Aug 22, 2013

I could simply be budgeting room for the USB core itself to wake.

nknighthb · on Aug 22, 2013

> No

I don't know who you think you're disagreeing with, but it clearly isn't me.

Dylan16807 · on Aug 22, 2013

acchow is disagreeing with your database query likeness. Your comment was effectively saying that it's just waiting until it's ready. And this is clearly something that can be disagreed with.

Based on the way this article is worded, it seems like there is no way to check when TRSMRCY is over. Imagine if you were waiting for a database query and if the database wasn't done thinking yet, simply accessing the socket would make the query abort.

nknighthb · on Aug 22, 2013

Your understanding of the problem is in error, and will be corrected by reading the linux-usb thread linked to by the article.

Dylan16807 · on Aug 22, 2013

Good! But that doesn't mean your original comment was infallible, since it didn't actually clarify such a thing.

JoeAltmaier · on Aug 22, 2013

The hub knows when the device is ready; just query it. A constant timeout is not needed. A give-up timeout might be employed, but there's no reason that can't be 100's of ms, nobody is waiting on that and it doesn't usually happen anyway.

kalleboo · on Aug 22, 2013

Be sure to check out the mailing list post linked from the G+ post which contains more technical details and proposed fixes http://marc.info/?l=linux-usb&m=137714769606183&w=2

milliams · on Aug 22, 2013

Why on earth is all the text on that website set to font-weight:600 and using Courier New of all fonts? Incredibly hard to read.

veeti · on Aug 22, 2013

GNU/Design.

demiol · on Aug 22, 2013

Optimized for "white on black" color scheme. http://marc.info/?q=configure

kalleboo · on Aug 22, 2013

That still looks terrible, hehe

k1chy · on Aug 23, 2013

Use this

http://marc.info/?l=linux-usb&m=137714769606183&q=raw

1337hax0ll · on Aug 23, 2013

I can read the website perfectly fine. It may not be shiny like all the iCrap, but it doesn't need to attract "average" users (e.g. spoiled, rich 12-year olds).

I would rather prefer kernel hackers to do something useful, instead of wasting time and money to make every LKML archive look aesthetically beautiful.

hackers != designers

karlshea · on Aug 23, 2013

So angry. Did you design Courier New or something?

> 1337hax0ll

> hackers != designers

I guess not, so there's no reason.

foobarbazqux · on Aug 23, 2013

I think his point explains git beautifully.

karlshea · on Aug 23, 2013

True story.

(I bet all of them had their terminal font set to Courier New Bold while doing so)

marshray · on Aug 23, 2013

I don't know if it's safe here to admit it, but I always kinda liked the xterm fonts.

Is there something wrong with me?

foobarbazqux · on Aug 23, 2013

No, fixed space fonts can be beautiful.

debian69 · on Aug 23, 2013

icrap , you crap.

alexchamberlain · on Aug 22, 2013

We should applaud them for standing up and saying "Hey, we cocked up, sorry!"

valisystem · on Aug 22, 2013

Definitely yes. Them or others, whatever. And Not for every issues. But I want to ear about issues that have been worrying for a long time and/or annoyed a lot of people, or that are very complex and shady. And the only way to encourage communication about bugs is to congratulate people for their fixes/problem isolation.

scrrr · on Aug 22, 2013

Nothing wrong with that statement, except I think it should be a given.. Do we need a pat on the back for doing the right thing? :)

sghill · on Aug 22, 2013

Yes. As an industry, we're not very good at recognizing the great things people do. We tend to focus solely on mistakes. It's not easy to admit failure, and if we want it to continue we should recognize the effort it takes.

jaredmcateer · on Aug 22, 2013

Maybe if we did more people would be willing to do it.

bluesign · on Aug 22, 2013

this is wrong interpretation actually.

There is no "maximum" for a reason. Because it should be evaluated as "hey hardware developer, you will have guaranteed 10 ms from System Software to resume". If you don't wake up in 10ms, you are clearly violating the spec.

9.2.6.2 states: After a port is reset or resumed, the USB System Software is expected to provide a “recovery” interval of 10 ms before the device attached to the port is expected to respond to data transfers. The device may ignore any data transfers during the recovery interval. After the end of the recovery interval (measured from the end of the reset or the end of the EOP at the end of the resume signaling), the device must accept data transfers at any time.

delinka · on Aug 22, 2013

Nothing there says that the hardware must be ready at or after 10ms. It simply says that software can't ask for anything before 10ms is up. Software has to wait 10ms, and then might have to wait longer.

masklinn · on Aug 22, 2013

Yep, the mentioned 7-14 table makes this very clear: it's a big-ass table of timing names with a column for minimum and a column for maximum (and a bunch of other columns for e.g. timing unit), where either column may be empty (and for many timings only one of them is filled).

In that table, TRSMRCY has a minimum value (of 10ms) but no maximum.

makomk · on Aug 22, 2013

Bear in mind that here, something which is a minimum from the point of view of the OS is a maximum from the device's point of view. If the OS is allowed to use any value above 10ms for TRSMRCY then the device can take at most 10ms to prepare itself because the OS can send a request at any point after that.

caf · on Aug 22, 2013

That interpretation doesn't seem to make much sense though, because there would be no point in specifying the 10ms at all - it means nothing, and doesn't put any restrictions on anyone.

shabble · on Aug 22, 2013

An alternative interpretation is that by forbidding the host from issuing any commands for that initial period, the target device could (if it happens to be expedient) do all sorts of otherwise spec-violating things, safe in the knowledge that nobody will ever find out[2]. After that grace period, it has to go back to playing by the rules, but doesn't necessarily have to be operational. Whether or not it's operational yet can be queried[1] (after the first timeout), and a decision to use it, retry the query after a fixed or variable additional fallback, or mark it as failed.

A better spec might recommend or mandate values for fallback quanta and repetitions, or a maximum bound on the delay, rather than leaving it vendor-gets-to-choose.

I didn't look too hard, but a decently marked timing diagram would be nice, and might make it easier to spot the unbounded nature of it, rather than having to cross-reference the inline '10ms' value with the minmax table elsewhere.

[1] At least, if I understand it correctly. If accessing the status info uses the same mechanism as general traffic, it's obviously subject to the flaw described, and this interpretation is wrong.

[2] c.f. https://en.wikipedia.org/wiki/Don%27t-care_term and perhaps https://en.wikipedia.org/wiki/Virtual_particle

delinka · on Aug 22, 2013

It does indeed put restrictions on the software author not to expect hardware to be ready in less than 10ms. Perhaps the specification is somewhat nonsensical. That does not permit the software author to create restrictive interpretations. IMHO, it informs the software author that flexibility should be allowed in their code.

caf · on Aug 22, 2013

But under your interpretation, the software can't expect the hardware to be ready in less than 100ms, 10s or 10 minutes either - so the value "10ms" is meaningless.

delinka · on Aug 22, 2013

Indeed. And the software should continue to wait for readiness or decide when to give up. Giving up at the 10ms mark isn't against spec, it's just not prudent.

bluesign · on Aug 22, 2013

"After the end of the recovery interval the device must accept data transfers at any time." simply says that "hardware must be ready (in order to accept data transfer)"

delinka · on Aug 22, 2013

But the "recovery interval" is not defined, leaving the device to decide what its "recovery interval" is and guaranteeing that software will not expect it to be less than 10ms.

bluesign · on Aug 23, 2013

actually it is defined as 10ms.

USB System Software is expected to provide a “recovery” interval of 10 ms before the device attached to the port is expected to respond to data transfers

mortehu · on Aug 23, 2013

Is there anything in the spec suggesting that the hardware can take longer than 10ms? Given the phrasing in the spec ...

> 7.1.7.7 Resume

> The USB System Software must provide a [minimum] 10 ms resume recovery time (TRSMRCY) during which it will not attempt to access any device connected to the affected (just-activated) bus segment.

> 9.2.6.2 Reset/Resume Recovery Time

> After a port is reset or resumed, the USB System Software is expected to provide a “recovery” interval of [at least] 10 ms before the device attached to the port is expected to respond to data transfers.

... I would say that thinking the hardware can safely take more than 10 ms seems like a naive interpretation. You may note that system calls like usleep(10) sleep for "at least" 10 ms; there's no upper bound. The spec simply reflects this fairly typical aspect of software.

nly · on Aug 22, 2013

The true intention of the spec is academic at this point. There are millions upon millions of devices out there with one interpretation and they're not changing. Linux can either increase the grace period or be tarnished as having bad USB suspend.

annnnd · on Aug 22, 2013

Congrats! But how nobody analysed this bug for 8+ years is a bit of a mystery to me...

RyanZAG · on Aug 22, 2013

Well the Linux USB maintainer has spent the last month or so trying to get Linus to be more polite, so I guess those kind of things have a higher priority!

I kid, I kid...

The reason is that it is incredibly difficult to link the disconnect to the cause as the 10ms is likely sufficient in 99% of cases - until it suddenly isn't. This means that you could be running test cases on a certain device for a year, and suddenly the test will fail the day after. When the test case mysteriously fails randomly like that on only a subset of devices, the assumption is that the hardware is faulty. These kind of failures would likely be higher on lower quality, less optimized hardware as well, furthering the perception.

As far as I can tell, the reason this is fixed now is because known good hardware from Intel started exhibiting the same error which got people at Intel to track it down directly, as they knew it wasn't their hardware at fault.

masklinn · on Aug 22, 2013

> the 10ms is likely sufficient in 99% of cases

from TFA:

> the time is above 10 ms in about 8% of the remote wakeup events I've tested.

So 10ms is sufficient in about 92% of cases, barely more than 9 in 10.

raphman · on Aug 22, 2013

This figure is just for two specific devices [1]:

>Out of 227 remote wakeup events from a USB mouse and keyboard: > - 163 transitions from RExit to U0 were immediate ( < 1 microsecond) > - 47 transitions from RExit to U0 took under 10 ms > - 17 transitions were over 10ms

So, 10 ms might indeed be sufficient for 99% of devices. But some devices (i.e. this mouse/kb combo) needs between 10 ms and 12 ms in 8% of all wakeups.

[1] http://marc.info/?l=linux-usb&m=137714769606183&w=2

claudius · on Aug 22, 2013

Which depends on the cases tested by Sharp – as the GP, depending on the hardware you use, you can probably get 100% or 0% (if you try really hard).

miga · on Aug 22, 2013

I add it to my list of hard-to-replicate-and-locate bugs then.

JoeAltmaier · on Aug 22, 2013

Because nobody cares about suspend-resume power mgmt. If it doesn't work, curse it, pull it out and put it back in again, voila it works.

The people who really care about and study the spec, are those who have to support fixed devices i.e. USB devices internal to an appliance. They physically cannot be removed by the user. So suspend/resume has to work.

Embedded programmers have to deal with totally-broken drivers/specs all the time. There are probably 100s of folks who knew about this and dealt with it (bumped the timeout in their embedded kernel to match the devices they support) and never said anything to anybody.

masklinn · on Aug 22, 2013

In this case, the pain point was apparently a ChromeOS device: http://marc.info/?l=linux-usb&m=137714769606183&w=2

> This bug has been reproduced under ChromeOS, which is very aggressive about USB power management. It enables auto-suspend for all internal USB devices (wifi and bluetooth), and the disconnects wreck havoc on those devices, causing the ChromeOS GUIs to repeatedly flash the USB wifi setup screen on user login.

nutate · on Aug 22, 2013

Also with sound devices it is more than a tad annoying to unplug a USB external audio interface (powering down your monitors, etc) because it magically disappeared for some reason.

This is an amazing fix if it is the root of the sorts of problems I've seen on Linux (which've kept me crawling back to Mac for hardware support)

smackfu · on Aug 22, 2013

When cheap hardware acts like it doesn't follow the spec, no one digs too deep, because it's always going to be quite frequent, and there's nothing you can do about it. It's very rare that it turns out to actually have been following the spec, and you had the spec wrong. That's the practical reason.

xradionut · on Aug 22, 2013

I can't speak for kernel developers, but when you have complex and large codebase running on a huge variety of hardware, you will have some edge cases that are rare or difficult to debug. And I don't envy the folks that have to interface directly with hardware, I have enough fun in database land...

nitrogen · on Aug 22, 2013

I'm an embedded software developer now doing some things with databases.... Give me hardware any day :-)

varx · on Aug 22, 2013

I'm an embedded software developer who designs internal embedded databases and import/export routines to external databases.

My brain hurts, but at least I'm never bored. :/

16s · on Aug 22, 2013

Why is that variable set at 10? Who would question that?

The spec says 10 too. It's the "at least 10" part that was missed. That's very subtle, does not stand out and is easily over-looked unless someone is really auditing code and reading specs carefully.

kbart · on Aug 23, 2013

Take a look at Kernel USB source code. I did. Once.

oakwhiz · on Aug 22, 2013

This is a very interesting type of bug that I have often seen cropping up around hardware interfaces in microcontrollers.

ape4 · on Aug 22, 2013

Its a good thing that Linux is open and transparent. Good to admit a bug (and exactly what it is) rather than silently deny then possibly fix.

Also, somebody uses Google+ ?

davidw · on Aug 22, 2013

For whatever reason, there seems to be a number of Linux people on Google Plus, including Linus Torvalds.

archivator · on Aug 22, 2013

And Android people, and various YouTube people, and Internet celebrities and ...

That said, G+ is not half bad. The app beats Facebook hands down. Live Hangouts are also a neat way to engage with your audience.

foobarqux · on Aug 22, 2013

What are the conditions where this problem manifests?

I have a Das Keyboard that sporadically become unresponsive until I unplug and plug it back in. How do I know if my problem is caused by the issue described in the article?

blaenk · on Aug 22, 2013

For what it's worth, I too have a Das Keyboard (Ultimate) and I don't experience this problem (Arch 64-bit).

Hopefully that helps narrow down your issue.

foobarqux · on Aug 23, 2013

Thanks. I'm on 32-bit Ubuntu 13.04 but problem has presented itself for at least last several releases.

baq · on Aug 22, 2013

it happens after a resume from sleep?

foobarqux · on Aug 23, 2013

No. I'm typing or trying to type and it becomes unresponsive. Can't even toggle the caps and num lock LEDs on the keyboard.

miga · on Aug 22, 2013

Good we have a fix for a bug, that has been pestering me for quite a long time. As for maximum timeout, I believe that setting a maximum timeout in sysfs with default of 1s should make satisfy most people (unless anyone wants per-device max wait time?)

Fuxy · on Aug 22, 2013

50ms should be quite enough i think. That's 5X the minimum, more than any proper device should ask for. If you want to be extreme you can make it 100ms but any more than that is way to extreme.

Someone · on Aug 22, 2013

"50ms should be quite enough i think"

If you are going to make a statement like that, make it "64ms ought to be enough for anyone". Either way, it won't help. If you write kernel code or interface with unknown hardware, you must be paranoid to the bone to get robust code. Double so if your kernel code talks with hardware you do not control.

"more than any proper device should ask for."

If devices asked for time, things would be easy; you either reply 'no', or you give them the time they ask for. The problem is that they take time without telling you.

Fuxy · on Aug 23, 2013

Well given that 90% of hardware out there is supported just file with the current setting i suspect 50ms would qualify as quite paranoid.

codex · on Aug 23, 2013

This is alarming. If the issue really was that simple, it strongly indicates that Linux kernel developers don't put a lot of effort into investigating problems whenever a convenient scapegoat--faulty hardware--is available. For shame.

T3RMINATED · on Aug 22, 2013

you are probably going to get the middle finger from Linus Torvald and he will say it was built like this by design and your wrong.

Jugurtha · on Aug 22, 2013

Well, that's why you don't hardcode a magic value, nor do you continuously poll the state of a device and rely instead on interrupts: That's what they're made for.

exDM69 · on Aug 22, 2013

> Well, that's why you don't hardcode a magic value, nor do you continuously poll the state of a device and rely instead on interrupts: That's what they're made for.

There was a mention about this in the OP. There were no interrupts for this state transition in USB prior to USB3. "The Intel xHCI host, unlike the EHCI host, actually gives an interrupt when the port fully transitions to the active state."

In addition, a lot of hardware initialization is based on delays and polling by design.

Jugurtha · on Aug 22, 2013

I'm not familiar with the USB spec. It happened I did tear-downs of flashdrives to find-out which ucontroller there is and try to dump stuff, etc... (For anyone interested, look up VID:PID and Google some microcontroller chip. It's a fascinating world where Russians are very present).

Your second comment about a lot of hw initialization being based on polling gives place to an "a fortiori": Why cascade this necessary evil to software. It's true, in the beginning, writing code for microcontrollers, you avoid working with interrupts. You work with assembly language, and get a bit lazy and hard-code delays (and even then, I'd choose a longer delay than the one specified as "max" in the datasheet : it's not like a chip not performing as well as is written in a datasheet never happened, so I take my precautions).

But then battery life reminds you of bad code practice: This continuous polling is draining. (And you put the uc to sleep :) )

Even more, if the application's involving sensors there's no way around using interrupts. Unless the thing is powered by a wall-socket, but even then, you get your consciousness preventing you from sleeping at night thinking about that horrible, barbarian code you put in there.

But then again, I don't know the USB spec and this may change so it's cool. And they have come a long way.

alexchamberlain · on Aug 22, 2013

That's not entirely fair; standards are full of hard coded values.

masklinn · on Aug 22, 2013

In fact the 7-14 table mentioned in TFAA is literally a table of magic values to hard-code: it's a big table listing all the timings of section 7 and their minimum and maximum values (either one of which may be absent)

adestefan · on Aug 22, 2013

Especially timer values.

zwdr · on Aug 22, 2013

USB 3.0 uses interrupts, but if you're trying that with 2.0 you're gonna have to wait for those interrupts a _long_ time.

JoeAltmaier · on Aug 22, 2013

Agreed. There's no reason to slavishly imitate a spec, when you can be more generous or better yet just test and wait.

Embedded programmers know this; you can't ship working appliances without dealing with these issues.

barrkel · on Aug 22, 2013

There's no reason to slavishly imitate a spec

You're forgetting the feeling of smug virtuousness you get when you end up being incompatible because you're more technically correct (the best kind) than the other components you're interacting with.

You get to say the other guys are all wrong, wage wars against them, blacklists, all the usual religious crap etc.

rustynails · on Aug 22, 2013

Now can someone fix the embarrassing network bug in Linux. You know, where if you access a link to a network, or access an open networked path after 255 seconds or so ... You receive a network error. It still begs belief that such a fundamental aspect of Linux is broken ... When I show Linux to newbies and this fault occurs (ie. 100% of the time), I simply say "Linux isn't perfect ..." But inside, I cringe...

It occurs under all distros I've tried and it's been there for years. Even on different computers with different hardware...

vidarh · on Aug 22, 2013

I have TCP connections that have stayed open for months so this is highly unlikely to be a kernel issue. I have no idea what you might be running into as I've never seen anything like that occurring.

viraptor · on Aug 22, 2013

> You know, where if you access a link to a network, or access an open networked path after 255 seconds or so ... You receive a network error.

I don't even know what this means. What's the actual, reproducible scenario?

joosters · on Aug 22, 2013

If you are connecting to a remote system, it could be NAT configured badly on your router.

The router provided by my ISP (Virgin Media) is ruthless at closing idle TCP connections after only a few minutes. I'd see this with idle SSH logins being closed all the time.

The solution (for me at least) was to ensure connections used TCP keepalives, and vastly decrease the keepalive times (various sysctl calls, I don't have the details to hand).

danbee · on Aug 22, 2013

Or just use a decent router and put the Virgin supplied thing in modem mode.

joosters · on Aug 22, 2013

My fix involved spending no money and required supporting no more devices :)

jrabone · on Aug 22, 2013

But instead required a bunch of moderately obscure changes to system config which you are bound to forget after a few years when you reinstall / image a new machine. The Virgin Superhub is a crappy barely-consumer grade box from Netgear with firmware written by Virgin. Modem mode is all it's good for. Sometimes the right answer is to spend the money on a decent router - Draytek are passable.

username42 · on Aug 22, 2013

Never had such error. Even in Linux 0.99pl10 (in 1992), network was working fine. I have always used a lot remote access to X servers. If network was broken, Linux would have been unusable.

octo_t · on Aug 22, 2013

backing the previous poster here, I've never experienced anything like this at all. "Network error" makes it sound like you're using KDE or GNOME or something, or Samba isn't liking your configuration...

eksith · on Aug 22, 2013

That's an odd.

Have you eliminated all other variables E.G. Common utility/setting, cabling, switch/router etc..? I've had paths open for much longer than that without access issues.

mh- · on Aug 22, 2013

> where if you access a link to a network, or access an open networked path after 255 seconds or so

this terminology doesn't really make sense in the context of a TCP connection.

but, anyway, check:

    net.ipv4.tcp_keepalive_time
    net.ipv4.tcp_keepalive_probes
    net.ipv4.tcp_keepalive_intvl

docs for the values here: https://www.kernel.org/doc/Documentation/networking/ip-sysct...

stinos · on Aug 22, 2013

not sure what you mean exactly, but one thing that has caused tons of trouble here, with sometimes the sole solution being a restart (ok our IT maintainer might be doing something worng, yet..) is the opposite: take a bunch of workstations and a bunch of servers, put home directories and data on servers then put everything together using NFS shares. Run analysis and whatnot on the data. Then make the server go down somehow and watch all workstation getting completely locked up without seemingly ever generating some kind of timeout error instead waiting endlessly on a dead connection.

rwg · on Aug 22, 2013

That's kinda how NFS was designed to work. NFS comes from the dark ages of networking, where transient network errors were very common, even on local networks.

"man nfs" and "man mount.nfs" and search for "soft", "intr", "tcp", and "timeo". It sounds like the solution to your problem is some combination of those options.

stinos · on Aug 22, 2013

thanks, will put some more effort into it if it ever happens again

phaemon · on Aug 22, 2013

Check that you've assigned an FSID to each share in /etc/exports