
Are we shooting ourselves in the foot with stack overflows? - nuriaion
http://embeddedgurus.com/state-space/2014/02/are-we-shooting-ourselves-in-the-foot-with-stack-overflow/
======
kens
If I'm reading the testimony correctly, there is actually no evidence that a
stack overflow caused unintended acceleration. The idea is that Toyota used
94% of the stack, and they had recursive functions. If (big if) the recursive
functions used enough stack to cause an overflow, memory corruption could
happen. If (big if) that memory corruption happened in exactly the right way,
it could corrupt variables controlling acceleration. And then, maybe,
unintended acceleration could occur.

But that's a far cry from the stack overflow actually causing any cases of
unintended acceleration.

~~~
shultays
Isn't 96% of stack usage a bit high? When you are dealing with recursions?

You pay a lot for those cars, can't they at least put better electronic
hardware. They probably have less than my phone from 5 years before

~~~
weland
> When you are dealing with recursions?

They _shouldn 't_ be dealing with recursions. If stack corruption is what
caused their failures, inappropriate testing played an important role IMHO.

> You pay a lot for those cars, can't they at least put better electronic
> hardware.

This isn't how it works for cost-sensitive designs. You don't hear people
boasting about how they have a quad-core car computer and how the touchscreens
from their motor control are perfect for Facebook interactions.

The way people think about this is, if half your RAM memory never gets used,
then you used twice more than you need and your module is more expensive than
it should be. CPU use never increases past 20%? It's about 80% more powerful
than it needs to be. And so on.

"Better electronic hardware" (in the sense of "more powerful" or "faster")
also introduces additional complexity. This means more difficult constraints
in testing, longer and more expensive verification processes, additional non-
deterministic behaviour and so on.

Not that their system wasn't at fault. It was, but throwing more hardware at
it wouldn't have made it better.

~~~
chiph
But having some additional capacity available gives them the ability to do
field upgrades. The firm I worked for a while ago had to undertake a very
expensive hardware refresh because there just wasn't any way to get any
additional bug fixes into the field -- they were down to 20 bytes free. In
something like a car, that you know people are going to drive for 10+ years,
you need that extra space not only to make bug fixes, but also to comply with
new legislation (such as brake override), and also to offer a few new features
to your customers.

~~~
sitkack
Another valuable engineering lesson; Always Have Headroom. If one designs or
operates to the limit there is no margin for error. Resilient systems can get
pushed beyond their acceptable limits and recover.

~~~
pnathan
1+.

IMO, You need several sets of limits: standard limits posted to the consumer,
engineering limits posted to the techie/maintenance guy/developer/etc, and
actual limits.... each of those is comfortably beyond the others. Know the
actual limit, but design well under that if at all possible, because the
system _will_ be misused.

------
bjourne
Even if you work in a gc language with a vm and all memory errors are checked,
here is the major, MAAJOR, wisdom you should take with you:

 _The crucial aspect in the failure scenario described by Michael is that the
stack overflow did not cause an immediate system failure. In fact, an
immediate system failure followed by a reset would have saved lives, because
Michael explains that even at 60 Mph, a complete CPU reset would have occurred
within just 11 feet of vehicle’s travel._

We have seen this scenario played out a million times. Some system designers
believe it is acceptable to keep the system running after (unexpected) errors
occur. "Brush it under the rug, keeping going and hope for the best." Never
ever do that. Fail fast, fail early. If something unexpected happens the
system must immediately stop.

~~~
aninhumer
When you're talking about a control system "Fail fast, fail early" isn't
necessarily an option. In this particular example, while it's arguable that
temporary loss of control might be better than the actual outcome, it's still
a pretty unacceptable thing to happen.

At the point where you are trading off one failure state for another, you need
to think very carefully about which one is truly worse, and sometimes "Fail
fast, fail early" is a _much much_ worse choice than the alternative.

~~~
bjourne
Experience (which always should trump theories about how things should work)
proves that you are wrong. What would have happened in this case if fail fast
was used and the hardware had immediately resetted upon the first byte being
written outside the stack area? Likely the bug would have been caught and
fixed during testing because debugging a randomly resetting cpu is
comparatively easy. It's a hundred times better than people being _killed_
because small unexpected errors are allowed to accumulate.

~~~
aninhumer
>Likely the bug would have been caught and fixed during testing

Sure, for testing, you should absolutely fail early and loudly, because there
are no consequences for doing so. But in the real world, and especially in
control systems, "failing" can cause more damage than persisting in a
corrupted state.

> small unexpected errors are allowed to accumulate.

I'm not arguing that you should just ignore errors completely, just that the
correct response to a particular error must be very carefully considered, and
your blanket suggestion to reset immediately is often a terrible idea.

I think a far more appropriate response to a single byte being written outside
of the expected area, is to shut down all _unnecessary_ processes, and tell
the driver to pull over. This would be _far_ safer than just randomly
disabling the controls, however temporarily.

~~~
RogerL
Sending uncontrollable accelerations for hundreds or thousands of feet is
safer than a maximum of 11 ft of coasting?

Yes, there are always trade offs to be made. But a corrupted system is utterly
unpredictable - if a control port _can_ do something, it eventually will, at
the speed of electronics. On the other hand, a shutdown is entirely
predictable. We design for it, because every system _does_ eventually fail and
shut down. It's a 'normal' mode of operation in this sort of system design.
That tends to weight the design heavily towards very fast resets and
redundancy in mission critical (people die if you mess up) systems. I can
design a system to handle a shut down control board (limit the robots arm's
travel, etc). I often can't do anything if the arms are waving about randomly,
under power.

I'm not opining randomly; I've worked in flight control, robotics, UAVs, and
factory machinery. Until my current job, I've always had to worry about
killing somebody. You really can't let software that controls dangerous
equipment continue to run in a damaged state unless the rest of the system has
a supervisor mode that can override and limit the system's behavior. Even
then, I'm having trouble thinking of a scenario where I would prefer to leave
that process running vs shutting it down.

~~~
dspeyer
What does it even mean to shut down the control systems on a moving car? All
stop? Coast? What's the physical behaviour that isn't likely to kill someone?

~~~
aaronem
At most, it means you're coasting with no power assist for steering or brakes
-- you'll need some extra braking distance, but that's what the shoulder is
for, and it'll take a bit more than the usual effort to turn the wheel, which
means that getting _onto_ the shoulder, especially if you're in a middle lane
when all this goes wrong, is going to need to be smartly done.

Taken all together, the impression I get is of a problem in driving skill
that's roughly as difficult as a blown tire at highway speeds -- possibly a
bit easier, actually, considering the deleterious effect a blown tire has on
steering control. An alert and competent driver should be able to handle
either situation without posing a deadly danger to herself or anyone else.

The same, I think, cannot reasonably be said of uncommanded acceleration. With
a blown tire or a failed ECU, all you have to do is use your ordinary driving
controls to get off the freeway and bring the vehicle to a safe stop. With a
throttle stuck wide open, you are suddenly a race car driver, only you're
neither in a race car, nor in a race. You _can 't_ bring the vehicle to a stop
with the ordinary driving controls, at least two of which -- the accelerator
and the brake -- are no longer responding properly or at all, which is itself
frightening and disorienting to the driver. In order to bring the vehicle to a
stop, the driver must shut off the engine with the key, at which point the
problem reduces to our failed-ECU worst case above, just with some more speed
to burn off.

But there is no circumstance, in either the normal driving regime or even any
other _ab_ normal one, where turning off a moving car is a proper or safe
response to any situation, which is why almost no driver has ever given the
slightest thought to doing so -- and when your car's speeding up past ninety
all of a sudden, and you're not telling it to, is maybe not the best time to
be thinking up new ways to interact with your car that you've never thought of
before. At a guess, I'd say some people whose Toyotas ran away on them were
able to come up with the idea in time, and they mostly survived. Others ran
out of time before they thought of it, and those unlucky souls mostly died.

~~~
scj
I agree with most of what you've said. But point out that you didn't go with
the worst case example, loss of steering / brakes on ice.

Unwanted acceleration on ice would be worse though. I can't imagine, in rush
hour conditions, being unable to control my acceleration. At best, you'd rear-
end someone, and their car would stop you.

------
wirrbel
First I was annoyed at yet another upvote fishing blog post on stack overflow.
Then I read it, while I was annoyed at getting caught by the catchy headline
that I conciously despised. Then I saw that it was not at all about some forum
on the web and now I cannot stop smiling.

------
erichocean
Isn't recursion—even if it's indirect—disallowed completely when doing
embedded C programming for safety-critical devices?

UPDATE: Yup, #70 on the MISRA C rules:
[http://home.sogang.ac.kr/sites/gsinfotech/study/study021/Lis...](http://home.sogang.ac.kr/sites/gsinfotech/study/study021/Lists/b7/Attachments/91/Chap%207.%20MISRA-C%20rules.pdf)

------
xerophtye
So what's the catch? We have been developing memory architectures, and
embedded systems, and OS's for decades now. So if the solution is as simple as
this post says, why hasn't it ever been implemented before?

I am hoping there are experts here that can shed some light on this

~~~
zurn
A mix of technological and cultural path dependence, cowboy attitude to
performance/optimization, and inability to quantify risks.

(Somewhat interrelated, eg cowboy attitude -> C, path dependency in C usage ->
hard to reason about programs -> hard to quantify risks)

~~~
diydsp
This. I've done consulting in embedded for years and years. The discipline of
modern desktop/web development has not reached the embedded world, except in a
few high-margin areas (such as military and transportation). There are 1,001
reasons for this, ranging from people who didn't learn source control, open
source and unit-testing doing much of the coding (iow people from other
disciplines besides software engineering and people who don't keep their
skillset current), to the fact that cross-compiling is much harder/slower to
debug than workstation debugging to the fact that resources are more limited
in embedded, so code is more terse, with fewer convenient debugging functions.

Other problems are with management- they see embedded code as a "write it and
forget about it" schedule item, not a continuous improvement one. Much of this
is due to the fact that embedded code is not as portable as workstation code.

All of these shortcomings sufficed when the number of time pressure of
embedded projects was much smaller 5-10 years ago. Now that embedded has
exploded in ubiquity, its requirements are increasing and it's getting less
time to be perfected in most markets.

So yeah, I guess it's safe to say the embedded world is in a bit of crisis at
the moment.

~~~
seren
Having work in a similar environment, I know at least one company where most
managers have a hardware/electronics background, because their product was
mostly HW when they started to work 20 years ago. But now, the split is more
akin to 80%SW/ 20%HW and they are managing SW developers without having much
idea of what SW development consists of. I know at least of one example of a
top engineering manager not understanding what version control was : "Please
merge all the features but not the bugs from branch X". (Which is a noble goal
in itself, but might be hard to attain !)

------
rcfox
Talking about how to catch stack overflows and protect your data against them
isn't useless, but it misses the point. There are rules/guidelines, like
MISRA[0] (which the testimony mentions 54 times!) for the automotive industry
that prohibit recursion, and tools that will check for conformance.

Toyota should not have been using recursion in the first place, and it seems
they were too cheap to invest analysis tools like Coverity.

[0]
[http://en.wikipedia.org/wiki/MISRA_C](http://en.wikipedia.org/wiki/MISRA_C)

~~~
bad_user
To me banning recursion in order to prevent stack overflows is like banning
arrays to prevent buffer overflows. It misses the problem, doesn't it?

~~~
rcfox
Sort of, but not really...

If you statically allocate an array, the compiler will ensure that you get the
amount of space that you asked for, or raise a compile-time error. If you
dynamically allocate an array (which you probably shouldn't be doing in this
case anyway) then you'll either get a pointer to an array, or NULL. Either
way, you'll know when it's safe to use the array. With a little bit of
discipline, it's not difficult to avoid buffer overflows.

Recursive functions don't have a guarantee of safely running. Yes, there are
ways to show that certain kinds of recursion will always terminate, and it
might even work when you're calling the function at the top frame, but what
happens if it's called further down the stack? What happens if the data
structure guiding the recursion changes and now it takes a deeper stack than
before?

~~~
bad_user
Most recursions can be written as tail-recursions. If the compiler can
optimize tail-recursions, in which case it will behave like a loop and throw
warnings when a recursion is not a tail-recursion, then the point is moot.
With algebraic data-types and pattern matching, often used for indicating the
stage of a recursion which includes the exit condition, the compiler can even
warn you that you missed a branch. In fact I find it easier to come up with
complex loops expressed as tail-recursions, because in time and with practice,
it gets easier to reason about all the possible branches and invariants.

The real problem is that we need a higher-level systems programming language.

> _Recursive functions don 't have a guarantee of safely running._

Neither do loops for that matter. A loop doesn't have any guarantee that it
will ever terminate. Most stack overflows that happen are due to recursions
with bad or missing exit conditions, but you can have the same problem with
plain loops too.

> _With a little bit of discipline, it 's not difficult to avoid buffer
> overflows._

Buffer overflows is amongst the biggest, most expensive problems in this
industry and the primary reason for all the invulnerabilities you're seeing in
the wild.

------
gkoberger
Completely unrelated to yesterday's "I No Longer Need StackOverflow"
[https://news.ycombinator.com/item?id=7251169](https://news.ycombinator.com/item?id=7251169)

I was all excited to defend StackOverflow.com.

------
jtokoph
My first thought was: How could stackoverflow.com be responsible for car
crashes?

~~~
Theodores
I suggest we _sing one tune to the song of another_ and re-purpose this thread
to be about stackoverflow.com, as if nobody read the article!

~~~
mikeash
I give it at least a 30% chance that this will happen automatically with
nobody doing it on purpose.

------
tragomaskhalos
I had an "unintended acceleration case" in my old Austin Morris 1300; the
cable connecting the pedal to the throttle snapped, jamming it at a fixed
(fairly high revs) level, requiring me to control the speed using the brake.

The solution was to pop open the bonnet and swap in a replacement cable, which
probably cost a couple of quid.

This recollection combined with the Toyota story merely convinces me that
automobile automation has got completely out of control.

~~~
userbinator
I think this issue is more to do with complexity than anything else; my car
has an ECU, it's based on an 8-bit CPU from the 70s, never had any problems
with it.

~~~
sitkack
So does mine, and it is fuel injected. A cable controls a plate that controls
the amount of air that gets into the engine. The only way for that design to
fail WFO is for the return spring that keeps the plate closed to snap. At that
point one would press in the clutch and turn off the ignition.

This kind of user corrective action is not possible on modern cars which I
consider a huge engineering flaw.

~~~
aaronem
I'm not tremendously knowledgeable about automobiles, but in a car with an
automatic transaxle, shouldn't dropping it in neutral and switching off the
ignition do essentially likewise?

I ask because that's how I practice responding to uncommanded acceleration,
which I've done on occasion since I first heard of the failure mode. I've done
this a few times a year in each of several cars, and as far as I can tell it's
had no ill effect, but if "as far as I can tell" isn't far enough then I'd
like to know it.

~~~
sitkack
I know of no specific danger in shifting an AT into neutral. You need to be
_EXTREMELY_ careful when switching off the ignition that you don't lock the
steering column. It would be better to

* switch to neutral

* safely exit the road and stop

* turn off engine

I am glad you have practiced it.

My concern is over the code running tiptronic transmissions, they are computer
controlled manual transmissions where you no longer have an effective physical
connection.

~~~
aaronem
My experience has been that switching the ignition from "run" to "accessory"
doesn't risk locking the column, but now that I think about it, dropping the
transmission into neutral should suffice -- the only concern I'd have there,
with a runaway throttle, would be that it'd wreck the engine to have it
running flat out with no load.

~~~
sitkack
The engine will only be run at max for 15-30 seconds until you pull off, it
will be fine. Better that than a potentially fatal accident. Only hardware
after all.

------
pwg
Example 7 on page 18 of "UNIQUE ETHICAL PROBLEMS IN INFORMATION TECHNOLOGY" by
Walter Maner seems quite appropriate here:

Quote:

"A program is, as a mechanism, totally different from all the familiar
analogue devices we grew up with. Like all digitally encoded information, it
has, unavoidably, the uncomfortable property that the smallest possible
perturbations -- i.e., changes of a single bit -- can have the most drastic
consequences."

[http://faculty.usfsp.edu/gkearns/articles_fraud/computer_eth...](http://faculty.usfsp.edu/gkearns/articles_fraud/computer_ethics.pdf)

------
cognivore
Stack overflows hate the elderly:

[http://www.forbes.com/2010/03/26/toyota-acceleration-
elderly...](http://www.forbes.com/2010/03/26/toyota-acceleration-elderly-
opinions-contributors-michael-fumento.html) (forbes.com)

~~~
aaron695
Yes, I thought the whole issue was myth.

Which does lead into, why are we trying to learn about stack overflows and
critical system issues about imaginary subjects.

And even if it's not myth, as above there's no proof that a stackover flow was
the supposed issue.

------
raverbashing
The least specialized in SW a company is, the worse the software is.

What we are accustomed to in discussing in HN for example _does not exist_ in
these worlds. Continuous integration? Unit test? Even complexity analysis.

And very very old code that's patched over and over and shipped "when it
works"

It's usually people who have had an academic contact with programming
languages and embedded development and _don 't know anything_ about code
quality. But you can bet their bosses incentive CMMI and other BS like that.
(Yes, complete and utter BS)

Not to mention ClearCase which seems to be a constant, the worse the company
the more they love this completety useless piece of crap

~~~
RogerL
_does not exist_

I've worked in that world for a long time, and I assure you we did continuous
integration, unit tests, and complexity analysis." Way back in the early 90's,
long before it made it into the general population, so to speak.

I agree that there are terrible groups out there, but in general there is a
far greater emphasis on safety, quality, and correctness than in the non-
mission critical world.

~~~
danielweber
Yeah, the software methodology at car companies makes a lot of the seat-of-
the-pants just-ship-it stuff that HNers are used to look like kindergarten.

The car companies know how to do this. Maybe they messed up in this case (I'm
skeptical of the article), but it's not because they don't know software.

~~~
pnathan
Yes and no.

The transcript is very enlightening. It was _extremely_ clear that on this
particular project, the software development process was a total trainwreck.
_No one_ who was familiar with the SW dev literature had technical leadership
and authority over the codebase. As a matter of fact, the transcript is so
shocking it could be used as a manual of antipatterns for SW development both
in embedded and out of embedded. A friend and I (we used to both work at an
embedded systems company) spent an evening going over the transcript and
mocking the errors. :-) By and large, the errors were of the design form.
E.g., too much work on the critical threads. Not separating brake and
acceleration threads. Four _thousand_ globals. I think the cyclomatic
complexity was something like > 1000 for the control path function. Etc.

One of the remarks is actually that Toyota had taken some lessons learned from
the time the codebase was developed and had been working on improving since
then. So that's good.

------
noelwelsh
It's sad that recursion is considered dangerous. Tail calls have been known
about for a very long time, and the duality between stack and heap for just
about as long.

~~~
reeses
Tail calls are only helpful if optimized into jumps instead of function calls.

~~~
pnathan
And this only happens if your compiler will recognize it. Not all compilers
are smart enough. And embedded compilers often don't have the love that
mainstream compilers get.

In the absence of specific & concrete evidence that your compiler does this
optimization and that you have and will tested this (including checking the
emitted assembly code), it is correct to assume that TCO does not happen and
to perfom stack depth analyses based upon that.

~~~
pessimizer
If you also built the computer, wouldn't you already have that specific and
concrete evidence?

~~~
pnathan
Beg pardon? I don't know what you mean. Most embedded companies purchase a
chip... sometimes the compilers come from the same company that made the chip,
sometimes not.

Regardless, it's the engineer's responsibity to not assume about such a
critical part of the design.

------
robryk
Would it be considerably expensive to check in runtime that SP is in an
expected range every time it gets moved? This'd work with multiple stacks,
too.

~~~
simias
Well assuming you run in virtual mode you could always leave the page/segment
directly below the stack unmapped, this way if the stack overflows it'll
trigger an exception by accessing invalid addresses.

~~~
prutschman
There is an enormous class of embedded devices not running on chips capable of
address virtualization.

~~~
ctz
You don't need virtual memory, just memory protection. Many microprocessors
and microcontrollers offer just an MPU, without the full fat of address
translation machinery needed for an MMU.

~~~
sitkack
[http://www.freertos.org/FreeRTOS-MPU-memory-protection-
unit....](http://www.freertos.org/FreeRTOS-MPU-memory-protection-unit.html)

------
pjmlp
Another example of C's impact into our daily life.

------
pasbesoin
I haven't waded into all this, and it's been years -- and years -- since my
education touched upon systems that physically separate operating instructions
from data memory.

But... sooner or later, it seems, we are going to go (back) there.

Instructions will become truly privileged, physically-controlled access. Data
may go screwy -- or be screwed with -- but this will not directly affect the
operating instructions.

Inconvenient? As development becomes more mature, instructions will become
more debugged and "proven in the field". Stability and safety will outweigh
ease and frequency of updates.

My 30+ year old microwave chugs along just fine. It doesn't have a turntable
nor 1000 W, but I know exactly what it will do, how long to run it for various
tasks, and how to rotate the food halfway through to provide even heating.

My 34 year old, pilot-light ignited furnace worked like a champ, aside from
yet another blower motor going bad. I listened to the service tech when he
strongly suggested replacing it before facing a more severe, "winter crisis"
problem.

The new, micro-processor based model is better in theory (multi-stage speeds,
and longer run times for more even heating). In practice, it's been a misery.
The first, from-the-factory blower motor was defective. When that was
replaced, the unit started making loud air-flow noises periodically.

Seeing the blower assembly removed, its constructed of _sheet metal_. The old
furnace, by contrast, had a substantial metal construction that was not going
to hum and vibrate if not positioned absolutely perfectly and with brand new,
optimized duct work.

Past a point, _reliability_ starts to -- _far_ \-- outweigh some other
optimizations.

This is going to become true in our field, as well.

------
Gracana
Are there any downsides to having the memory set up the "safe" way that they
describe? It seems like a win-win situation.

[edit] I guess I was thrown off by the shoot-yourself-in-the-foot scenario,
where the stack grows toward fixed data structures. If the heap and stack grow
towards each other, you have quite a bit of flexibility (though with some
danger of collision). If you have the stack grow towards fixed data
structures, its size is fixed _and_ it can cause a dangerous overflow. The
only disadvantage of the safe example is less flexibility, but for a critical
embedded system, that is fine.

------
jmnicolas
Although I use managed languages, I wouldn't want my code audited by the NASA.

When 180+ IQ brains analyze your work they're bound to find "horrible defects"
that no "competent" programmer would ever make.

~~~
tinco
I just refactored some code yesterday, I feel pretty confident, I'd love to
know what they think :P Though my app is just some silly ruby code, not a
realtime human life critical C embedded software.

~~~
reeses
Awesome, let's walk through every library, system, and machine call exercised
by your ruby code. What happens if you gc while hitting a chunk of ram that
was just hit with a corruption causing cosmic ray while your OS paged out your
app because your intrusion detection/prevention service went defcon two on a
leap millisecond that was announced since your last OS patch and your locale
was set to C?

Please support your answer with tracing analysis over a set of one billion
Monte Carlo simulations and present an accurate and up-to-date IDEF6 model of
the application system.

~~~
mkr-hn
Parking an SUV on Mars is a lot more complicated than parking one at the
grocery store.

~~~
reeses
Plus, if the one on Mars kills anyone, we _all_ know how that movie goes. One
does not simply walk away from killing an extraterrestrial.

I'm going to lick someone with a very bad cold now, just in case.

------
laichzeit0
I'm always skeptical about non-trival recursive calls and generally pass a
"depth" variable in as the first param, increasing it each time I do another
call, with some sane cut-off point where it just returns.

------
ragecore
I think cars should just come with slots where we could put in our phones and
bam!, powerful computing that you could carry along.

------
Fasebook
tl;dr: make the stack bigger! (then is it really a stack overflow?, oh by the
way, this won't work in most systems due to virtualized stacks on top of the
physical memory making concepts such as order of memory meaningless... but
nevermind that)

The obvious solution to stack overflows is to make the stack bigger. The
obvious problem with this solution is that it just kicks the can down the
road.

------
greatsuccess
Before you wonder about stack overflows, as yourself why the occupants never
applied the brakes.

