But that's a far cry from the stack overflow actually causing any cases of unintended acceleration.
- the abhorrent state of the engine control module code,
- the RToS's design,
- the critical data structures right above the stack,
- that those critical data structures weren't mirrored to detect corruption as is standard and as they did for other data,
- that single bit changes in that critical data structure right above a stack can cause the death of tasks in the RToS whose failsafe capabilities were located in those same tasks, and whose death was tested and confirmed to cause unintended acceleration consistent with accounts and descriptions of the event
- that the failsafe monitoring CPU was not designed to detect this failure, and in fact Toyota outsourced its design and didn't even have the source code to it...
Where was that demonstrated? It was hypothesized that it may be a result, but I'm not seeing anything other than speculation.
The article itself was basic and didn't really talk to the specific Toyota case, but the linked slides did... if you're going to speculate about the trial, look at Barr's slides, not that article..
The v850 has 1...256kBytes of RAM. Let's assume 256kBytes of RAM. Then lets assume they have a Stack of 16kByte. => There are only 983 Bytes of free stack. Then a function call with 5 Int parameters and 10 Int variables needs around 64Bytes.
=> Only 16 recursive calls are needed to have a stack overflow.
Further because the critical data structure in the os are not secured the os will have no idea that something is wrong.
(In several projects we had to have our critical data 2 times in RAM, once normal and once inverted. In a critical project we even had to do all computations two times, once with the normal data and once with the inverse data. This way you will also find some bit flipps of the ALU, RAM etc.)
In the slides i found that the stacksize is 4kByte => only 245Bytes are free! So you can't make many recursive calls!
taking the car to the toyota repair place, they said there was nothing wrong with the car, and nothing strange in the computer's logs.
And they always always focus on the car manufactuer of the moment. It used to be Audi's, and it's confirmed that a whole bunch of Jeeps actually will do this at car washes due to static discharge (problem's never really been fixed).
I would also expect that he would recognize it if the accelerator pedal had been stuck on a floor mat.
I think it's interesting that even in a small forum such as HN, we still have an anecdote. It's also interesting that car owners still claim to experience the acceleration problem after the two (mechanical) recalls.
It reminds me of the Therac 25 incident, where there were only "anecdotes" to be found for _years_, and the company concluded that a mechanical switch had to be the fault. It's very interesting that we now have evidence that the software _can_ be at fault.
Actually this points to his story being incorrect. If he slammed on the brakes and the car continued to accelerate than he probably slammed on the gas by accident instead. Testing shows that if you slam on the brakes while the engine is pegged at full throttle, you'll still come to a screeching halt. The brake system vastly overpowers the engine even in high powered cars. In a Camry V6 it actually takes just 16 feet further to stop from 70mph with the throttle wide open at 190ft vs. 174ft.
And the brake system is not drive-by-wire like the throttle is, so it's not susceptible to faulty ECU problems.
If i pushed the accelerator instead of the brake, you can sure bet I would have done more than just "bump" the taxi in front of me.
The brake was fully pressed to the floor. The engine downshifted and revved to try to move forward. Dunno what else to tell you. Sure seemed like the computer going crazy to me when it happened.
Btw, there have also been Prius recalls due to faulty ABS software, so while you'll probably always have some control, braking action can certainly be affected by software issues.
Also hitting the wrong pedal is not exactly uncommon. People don't like to admit they just made a mistake, hence claims of "unintended acceleration". But just like the claims against Audi back in the day, the modern claims against Toyota appear to be unsubstantiated.
Some people might be unwilling to admit that they stepped on the wrong pedal, but the likelyhood that some random HNer would do that and then make a post about it?
ABS brakes work by relieving _or_ reinforcing brake pressure, so depending on the system you could end up with uneven, little or no braking action after a failure. Or you could end up with locked wheels.
And I don't think HNers are somehow less likely to hit the wrong pedal or refuse to admit it. There's always that possibility that they don't even know they hit the wrong pedal.
..and the 400% increase of unintended acceleration events (from the testimony) starting in 2004 is also a complete accident, or brought on by some witchhunt?
Aren't you stretching the limits of what is a reasonable assumption in order to maintain that opinion?
There's also no specific reason to think the driver would identify various mundane fault conditions - i.e. if your car lurches forward and you hit a taxi, then such an impact would be enough to unwedge a stuck carpet.
There's also lots of unknown detail - i.e. if the car was accelerating out of control while you road the brake, and you hit something at 5mph, then did the car stop accelerating at that point? Was the engine damaged and it stopped then? Why not shift into neutral if you're cognizant of the fault condition? Did the car not shift or what happened etc. In these cases make and year model - in Thailand - is pretty important too, since its a market which would have a lot of old used cars. A 1980s model minivan is going to have rather different throttle control.
It's all the hallmarks of urban legend, replete the buy-in line of it being foreign cars - which is how the story always goes and is played with the media. Sure, I'm willing to believe there are real control system issues, but it seems odd to me that no recall notices go out for ECU updates.
That there have been tens of deaths and millions of car recalls because of unintended acceleration in Toyotas cars is at least not an urban legend. Of course it will be very difficult to locate the exact problem, but this testimony is interesting in that it shows how that can be the case (contrary to what Toyota has claimed) [Edit: And they have actually been able to reproduce unintended acceleration by memory corrupption].
The transcript from the toyota case reveals that a crash on the control thread (managing acceleration, say) would also crash the brake control system until the car was restarted.
The downside is that hitting the break would not cut the gas.
The upside is that you would need two separate isolated systems to fault in order to have out-of-control acceleration. Even if the accelerator system is broken, slamming on the breaks will generally always win at the hardware level.
IIRC, I think for the court case, the brakes for the particular model of car could physically not stop the engine if the engine was going 100% acceleration at highway speed.
my minivan is a toyota innova. dunno the year, a 2011 or 2012. And this happened BEFORE i heard of any software glitches. And right when it happened my first thought was "MY GOD THE COMPUTERS GONE CRAZY"
edit: oh and one more thing: I'm a 1 foot driver, and i could tell if my foot wasn't on the brake! ;)
i mention in another reply: i thought it was the computer's fault, and only months later did I see articles about other toyotas having the same problem due to a software bug.
You pay a lot for those cars, can't they at least put better electronic hardware. They probably have less than my phone from 5 years before
Car electronics have a very long development process. When the cars in question (models from ~5 years ago) were designed (~10 years ago), the hardware they chose was probably quite decent for that era.
When the next model of the car is designed, they will most likely end up using the same model of computer (or a successor with conservative upgrades) to avoid having to redesign the hardware and software that much.
The cost of the actual hardware is negligible compared to the cost of the redesign.
I would argue that Toyota squandered billions with this failure. The ECU needs to be a protocol that can be swapped out at any time to decouple the evolution of the machine.
It is a shame, morally and fiscally that embedded development isn't using safe, provable and verifiable languages.
The WRT54 also runs in a comfy corner of your living room, fails every few years, and crashes.
They shouldn't be dealing with recursions. If stack corruption is what caused their failures, inappropriate testing played an important role IMHO.
> You pay a lot for those cars, can't they at least put better electronic hardware.
This isn't how it works for cost-sensitive designs. You don't hear people boasting about how they have a quad-core car computer and how the touchscreens from their motor control are perfect for Facebook interactions.
The way people think about this is, if half your RAM memory never gets used, then you used twice more than you need and your module is more expensive than it should be. CPU use never increases past 20%? It's about 80% more powerful than it needs to be. And so on.
"Better electronic hardware" (in the sense of "more powerful" or "faster") also introduces additional complexity. This means more difficult constraints in testing, longer and more expensive verification processes, additional non-deterministic behaviour and so on.
Not that their system wasn't at fault. It was, but throwing more hardware at it wouldn't have made it better.
I work for a company that makes quad core computers for automotive use and they do end up being used for Facebook interactions among other things like the dashboard etc. The engine management computers will be a separate entity, though. If you look at the big auto shows from the past few years, the car manufacturers clearly do think that this going to be a major differentiator in the next years and it's going to be average consumer models too, not just premium sports cars like today.
But the quad core chips we sell today will be on the road in five or more years. By the time they roll out of the assembly line, the computers will not be spectacular by the standards of that day. A smartphone is 6-18 months from design to production, a car is several times that.
It's not like the car manufacturers were cheap on the hardware.
Instead of using stuff like netbook chipsets they tend to gravitate towards mobile chipsets. The difference between IMX.6 and AMD Jaguar (just examples, you can also look at Intel's chipsets and other boards like Tegra etc) is like night and day. Why isn't the Jaguar used in Mobile phones? Because it can use tens of watts compared to just few watts of IMX.6.
So at least for me it seems the companies wish to save few watts of power usage and few tens of dollars per car.
Things like engine control modules and even entertainment systems in cars operate outside of the environmental ranges that a netbook is designed to survive in. I don't want my car sidelined because of some unreliable computer.
There are also issues of logistics at stake, such as maintenance. Unless AMD is willing to manufacture a certain Jaguar chip for the 5-8 years that automotive manufacturers typically require it, the Jaguars wouldn't even be considered for many systems.
IMO, You need several sets of limits: standard limits posted to the consumer, engineering limits posted to the techie/maintenance guy/developer/etc, and actual limits.... each of those is comfortably beyond the others. Know the actual limit, but design well under that if at all possible, because the system will be misused.
Of course you're right if we're talking engine management/internal stuff, but complaining about the laggy/slow/annoying performance in all things entertainment is quite a well-known first world problem in my circles.
In fact, I'm annoyed by each car I drove over the last 10 years due to their inability to provide sensible hardware (and charge a huge markup for all these 'official' components on top: Think navigation: You get a 3rd party system for a fraction of the cost of the supplier provided one, often providing better features, decent updates, extensibility - while you're stuck with whatever your manufacturer grabbed for pennies).
So in context ("Increase RAM so that the stack doesn't grow into the area where my acceleration value is stored") you're right, this isn't an issue. In general though I haven't seen a car manufacturer that gets consumer electronics/entertainment etc. right.
What I find confusing about the article is that it describes how to avoid problems on a completely different architecture. ARM (Von Neumann) vs v850 (Harvard), hard stack exception vs none, etc.
These differences and resulting inapplicable recommendations confound what could be an interesting article.
Quite a challenge when the consequences for failure could be extreme.
The crucial aspect in the failure scenario described by Michael is that the stack overflow did not cause an immediate system failure. In fact, an immediate system failure followed by a reset would have saved lives, because Michael explains that even at 60 Mph, a complete CPU reset would have occurred within just 11 feet of vehicle’s travel.
We have seen this scenario played out a million times. Some system designers believe it is acceptable to keep the system running after (unexpected) errors occur. "Brush it under the rug, keeping going and hope for the best." Never ever do that. Fail fast, fail early. If something unexpected happens the system must immediately stop.
At the point where you are trading off one failure state for another, you need to think very carefully about which one is truly worse, and sometimes "Fail fast, fail early" is a much much worse choice than the alternative.
Sure, for testing, you should absolutely fail early and loudly, because there are no consequences for doing so. But in the real world, and especially in control systems, "failing" can cause more damage than persisting in a corrupted state.
> small unexpected errors are allowed to accumulate.
I'm not arguing that you should just ignore errors completely, just that the correct response to a particular error must be very carefully considered, and your blanket suggestion to reset immediately is often a terrible idea.
I think a far more appropriate response to a single byte being written outside of the expected area, is to shut down all unnecessary processes, and tell the driver to pull over. This would be far safer than just randomly disabling the controls, however temporarily.
Yes, there are always trade offs to be made. But a corrupted system is utterly unpredictable - if a control port can do something, it eventually will, at the speed of electronics. On the other hand, a shutdown is entirely predictable. We design for it, because every system does eventually fail and shut down. It's a 'normal' mode of operation in this sort of system design. That tends to weight the design heavily towards very fast resets and redundancy in mission critical (people die if you mess up) systems. I can design a system to handle a shut down control board (limit the robots arm's travel, etc). I often can't do anything if the arms are waving about randomly, under power.
I'm not opining randomly; I've worked in flight control, robotics, UAVs, and factory machinery. Until my current job, I've always had to worry about killing somebody. You really can't let software that controls dangerous equipment continue to run in a damaged state unless the rest of the system has a supervisor mode that can override and limit the system's behavior. Even then, I'm having trouble thinking of a scenario where I would prefer to leave that process running vs shutting it down.
Taken all together, the impression I get is of a problem in driving skill that's roughly as difficult as a blown tire at highway speeds -- possibly a bit easier, actually, considering the deleterious effect a blown tire has on steering control. An alert and competent driver should be able to handle either situation without posing a deadly danger to herself or anyone else.
The same, I think, cannot reasonably be said of uncommanded acceleration. With a blown tire or a failed ECU, all you have to do is use your ordinary driving controls to get off the freeway and bring the vehicle to a safe stop. With a throttle stuck wide open, you are suddenly a race car driver, only you're neither in a race car, nor in a race. You can't bring the vehicle to a stop with the ordinary driving controls, at least two of which -- the accelerator and the brake -- are no longer responding properly or at all, which is itself frightening and disorienting to the driver. In order to bring the vehicle to a stop, the driver must shut off the engine with the key, at which point the problem reduces to our failed-ECU worst case above, just with some more speed to burn off.
But there is no circumstance, in either the normal driving regime or even any other abnormal one, where turning off a moving car is a proper or safe response to any situation, which is why almost no driver has ever given the slightest thought to doing so -- and when your car's speeding up past ninety all of a sudden, and you're not telling it to, is maybe not the best time to be thinking up new ways to interact with your car that you've never thought of before. At a guess, I'd say some people whose Toyotas ran away on them were able to come up with the idea in time, and they mostly survived. Others ran out of time before they thought of it, and those unlucky souls mostly died.
Unwanted acceleration on ice would be worse though. I can't imagine, in rush hour conditions, being unable to control my acceleration. At best, you'd rear-end someone, and their car would stop you.
The point is that any erroneous result would cause the offending computer to be voted out immediately. And that's what you want, otherwise the system will appear to work fine, while being in an undefined state.
I think it was one of the Mars landers where they disabled error detection for landing. If the system reset near the ground then the lander would crash before the controller had rebooted, whereas a small error might not have disrupted the landing at all.
Fail fast is good advice, but should be weighed against other strategies to choose the most appropriate one for the circumstance.
Then it fails again. Resets again. Fails again...
It may be preferable to restart, but in many failure scenarios, if the source of the failure is unknown, it is not a given that you'll be able to restart cleanly again.
I agree that failing fast to an extent is preferable, but that too requires careful testing and consideration.
The missing bit is generally that if it fails, it needs to set a note that it should not resume normal operations.
UPDATE: Yup, #70 on the MISRA C rules: http://home.sogang.ac.kr/sites/gsinfotech/study/study021/Lis...
I am hoping there are experts here that can shed some light on this
As for the last part of the article, there are ways to get around this issue when using operating systems, too. Some of them depend on certain hardware features being present, of course, and what should be inferred from this is that, where such protection features are critical, hardware should be picked accordingly.
Edit: I've seen a lot of posts here blaming C and the unmanaged runtimes. While a managed runtime can provide some amount of protection, it's worth noting that:
* It requires a MMU. These aren't smartphones folks. An MCU with a MMU isn't cheap.
* In the absence of correct memory management from the kernel (and that also requires a MMU!), it's perfectly possible to smash data regions with overgrown stacks, even on managed runtimes.
* You can achieve a good amount of protection using special compiler features. That requires no special hardware support.
(Somewhat interrelated, eg cowboy attitude -> C, path dependency in C usage -> hard to reason about programs -> hard to quantify risks)
Other problems are with management- they see embedded code as a "write it and forget about it" schedule item, not a continuous improvement one. Much of this is due to the fact that embedded code is not as portable as workstation code.
All of these shortcomings sufficed when the number of time pressure of embedded projects was much smaller 5-10 years ago. Now that embedded has exploded in ubiquity, its requirements are increasing and it's getting less time to be perfected in most markets.
So yeah, I guess it's safe to say the embedded world is in a bit of crisis at the moment.
My experience is that it hasn't penetrated very far there, either. Sure, we had version control and code reviews. The process was so warped and bastardized from industry standards, though, that it became a CYA/box-checking burden to keep the SQAs who really didn't understand code happy than a tool to improve development. Plus, most of the managing and systems engineers were hardware types who didn't really understand what the software groups were doing or how they operated.
 Important note: I'm a hardware type, too. I'm just a hardware type that actually got and wrote software. Made the situation doubly frustrating.
Everyone is a C expert, except when they happen to be a tiny cog in a big coding machine.
"I know where all my pointers are memory allocations are!". Sure, but does everyone on your 20+ people team from the intern up to the senior guys know?
Then things like this are bound to happen.
Try that in Python, Ruby (or ML for that matter).
Toyota should not have been using recursion in the first place, and it seems they were too cheap to invest analysis tools like Coverity.
If you statically allocate an array, the compiler will ensure that you get the amount of space that you asked for, or raise a compile-time error. If you dynamically allocate an array (which you probably shouldn't be doing in this case anyway) then you'll either get a pointer to an array, or NULL. Either way, you'll know when it's safe to use the array. With a little bit of discipline, it's not difficult to avoid buffer overflows.
Recursive functions don't have a guarantee of safely running. Yes, there are ways to show that certain kinds of recursion will always terminate, and it might even work when you're calling the function at the top frame, but what happens if it's called further down the stack? What happens if the data structure guiding the recursion changes and now it takes a deeper stack than before?
The real problem is that we need a higher-level systems programming language.
> Recursive functions don't have a guarantee of safely running.
Neither do loops for that matter. A loop doesn't have any guarantee that it will ever terminate. Most stack overflows that happen are due to recursions with bad or missing exit conditions, but you can have the same problem with plain loops too.
> With a little bit of discipline, it's not difficult to avoid buffer overflows.
Buffer overflows is amongst the biggest, most expensive problems in this industry and the primary reason for all the invulnerabilities you're seeing in the wild.
I was all excited to defend StackOverflow.com.
The solution was to pop open the bonnet and swap in a replacement cable, which probably cost a couple of quid.
This recollection combined with the Toyota story merely convinces me that automobile automation has got completely out of control.
This kind of user corrective action is not possible on modern cars which I consider a huge engineering flaw.
I ask because that's how I practice responding to uncommanded acceleration, which I've done on occasion since I first heard of the failure mode. I've done this a few times a year in each of several cars, and as far as I can tell it's had no ill effect, but if "as far as I can tell" isn't far enough then I'd like to know it.
Many ECUs store data (user settings, logging and fault codes) to non volatile memory before shutting down so the key only sends a signal. It's not a hard switch. But it could still help if all other ECUs except the faulty does shut down correctly.
* switch to neutral
* safely exit the road and stop
* turn off engine
I am glad you have practiced it.
My concern is over the code running tiptronic transmissions, they are computer controlled manual transmissions where you no longer have an effective physical connection.
Mechanical pieces fail over time, but we know a whole lot more about inspecting for, finding and preventing mechanical faults than we know how to create defect free software.
Marginally relevant sidepoint: I remember reading that the one of the advantages enjoyed by the Americans over the Japanese in WWII was the formers' significantly greater expertise in mechanical maintenance of front-line equipment gleaned from their culture of fixing up and maintaining automobiles at home.
"A program is, as a mechanism, totally different from all the familiar analogue devices we grew up with. Like all digitally encoded information, it has, unavoidably, the uncomfortable property that the smallest possible perturbations -- i.e., changes of a single bit -- can have the most drastic consequences."
Which does lead into, why are we trying to learn about stack overflows and critical system issues about imaginary subjects.
And even if it's not myth, as above there's no proof that a stackover flow was the supposed issue.
What we are accustomed to in discussing in HN for example does not exist in these worlds. Continuous integration? Unit test? Even complexity analysis.
And very very old code that's patched over and over and shipped "when it works"
It's usually people who have had an academic contact with programming languages and embedded development and don't know anything about code quality. But you can bet their bosses incentive CMMI and other BS like that. (Yes, complete and utter BS)
Not to mention ClearCase which seems to be a constant, the worse the company the more they love this completety useless piece of crap
I've worked in that world for a long time, and I assure you we did continuous integration, unit tests, and complexity analysis." Way back in the early 90's, long before it made it into the general population, so to speak.
I agree that there are terrible groups out there, but in general there is a far greater emphasis on safety, quality, and correctness than in the non-mission critical world.
The car companies know how to do this. Maybe they messed up in this case (I'm skeptical of the article), but it's not because they don't know software.
The transcript is very enlightening. It was extremely clear that on this particular project, the software development process was a total trainwreck. No one who was familiar with the SW dev literature had technical leadership and authority over the codebase. As a matter of fact, the transcript is so shocking it could be used as a manual of antipatterns for SW development both in embedded and out of embedded. A friend and I (we used to both work at an embedded systems company) spent an evening going over the transcript and mocking the errors. :-) By and large, the errors were of the design form. E.g., too much work on the critical threads. Not separating brake and acceleration threads. Four thousand globals. I think the cyclomatic complexity was something like > 1000 for the control path function. Etc.
One of the remarks is actually that Toyota had taken some lessons learned from the time the codebase was developed and had been working on improving since then. So that's good.
Transforming recursive algorithms to tail calls only isn't always an option, or it makes the algorithm unnecessarily complex. The problem here is that in any critical control unit, recursion must be verified safe and/or there must be an external measure to detect and recover from stack overflows.
In the absence of specific & concrete evidence that your compiler does this optimization and that you have and will tested this (including checking the emitted assembly code), it is correct to assume that TCO does not happen and to perfom stack depth analyses based upon that.
Regardless, it's the engineer's responsibity to not assume about such a critical part of the design.
Edit: I'm sorry, I was somewhat wrong. Accessing stack below SP is considered a bug, but x86 enter instruction can make it seem like such an access has taken place if it faults:
But... sooner or later, it seems, we are going to go (back) there.
Instructions will become truly privileged, physically-controlled access. Data may go screwy -- or be screwed with -- but this will not directly affect the operating instructions.
Inconvenient? As development becomes more mature, instructions will become more debugged and "proven in the field". Stability and safety will outweigh ease and frequency of updates.
My 30+ year old microwave chugs along just fine. It doesn't have a turntable nor 1000 W, but I know exactly what it will do, how long to run it for various tasks, and how to rotate the food halfway through to provide even heating.
My 34 year old, pilot-light ignited furnace worked like a champ, aside from yet another blower motor going bad. I listened to the service tech when he strongly suggested replacing it before facing a more severe, "winter crisis" problem.
The new, micro-processor based model is better in theory (multi-stage speeds, and longer run times for more even heating). In practice, it's been a misery. The first, from-the-factory blower motor was defective. When that was replaced, the unit started making loud air-flow noises periodically.
Seeing the blower assembly removed, its constructed of sheet metal. The old furnace, by contrast, had a substantial metal construction that was not going to hum and vibrate if not positioned absolutely perfectly and with brand new, optimized duct work.
Past a point, reliability starts to -- far -- outweigh some other optimizations.
This is going to become true in our field, as well.
 I guess I was thrown off by the shoot-yourself-in-the-foot scenario, where the stack grows toward fixed data structures. If the heap and stack grow towards each other, you have quite a bit of flexibility (though with some danger of collision). If you have the stack grow towards fixed data structures, its size is fixed and it can cause a dangerous overflow. The only disadvantage of the safe example is less flexibility, but for a critical embedded system, that is fine.
When 180+ IQ brains analyze your work they're bound to find "horrible defects" that no "competent" programmer would ever make.
I just remembered an old article  about the software development process at NASA and how it is not putting reliance on the kind of rockstar-programmer-genius culture that is so common in other parts of the industry.
Please support your answer with tracing analysis over a set of one billion Monte Carlo simulations and present an accurate and up-to-date IDEF6 model of the application system.
I'm going to lick someone with a very bad cold now, just in case.
The obvious solution to stack overflows is to make the stack bigger. The obvious problem with this solution is that it just kicks the can down the road.