
Everything Breaks, All the Time. - joshuacc
http://jeff-vogel.blogspot.com/2011/06/everything-breaks-all-time.html
======
jng
This guy is wrong, plain and simple. Unforeseeable bit-copying errors caused
by cosmic rays and similar circumstances, which do exist, probably account for
less than a one out of every billion actual bugs/crashes experienced out
there. When a bit is toggled by a cosmic ray in memory, if it hits program
memory, it will with a significant frequency crash your program.

Most bugs you experience daily (Word or your favorite game/appp crashing,
etc...) are caused by actual software errors in overly complex systems with
many dependencies. Multithreaded-coding errors can often account for many of
those, but not only, complex system with many layers and complex dependencies
can often hide obscure behaviors that can cause crashes in a given machine if,
for example, you have a weird combination of disk drivers, file system code,
and an antivirus hooking and acting on every filesystem read or write.

When, for example, a Word plug-in makes a call to Word's object model, this
goes through easily 10 software layers until it reaches its target, some of
these layers being configured via the flaky Windows registry, others going
through jumps from VM-based to native code using weird "marshalling"
techniques, etc... in these cases, you may encounter buggy behavior in any one
of the 10 layers, or in a combination of two of them, even if it seems like
you are just incrementing a simple counter.

Most of the time, though, bugs are caused by the app's own code (your own
code): careless code, dangerous practices, lack of solid control-flow design,
etc... if you write really good code, it's unlikely you will have many support
issues. Only if you are working in some problem-prone area: plug-ins to other
complex, often poorly-designed products, code pushing graphics drivers to the
max, etc... where you get into "complex system" behavior.

Even if you use multithreading, if you control all the code, you can write
very solid code. If your multithreaded code is perfect, it won't crash.
Although it can uncover bugs in third-party libraries, etc... which is why I
tend to write only "worker threads" with no third-party dependency if
multithreading is required.

And I think it's very dangerous to warn novice programmers to think that the
bug is probably somewhere else.

~~~
gwc
His point makes a lot more sense in the context it was originally intended.
He's not making a point about programming or debugging in general; he's
specifically discussing tech support as a one-man indie game shop. In
particular, it's all about the cost-benefit tradeoff. In his words (taken from
the first post in the series - [http://jeff-vogel.blogspot.com/2011/06/seven-
tips-for-giving...](http://jeff-vogel.blogspot.com/2011/06/seven-tips-for-
giving-good-tech-support.html)):

 _But at the same time, as a small developer, you have very little time to
spare for support. Time spent getting the game working for one person is time
not spent making a new game for everyone. You will need to develop a sense of
when the time lost helping a person is not worth it, either because you won't
be able to solve their problem or because they will not able to implement the
fix you provide._

...

 _Remember: It's only worth the time to do tech support if you have the chance
to, in a reasonable amount of time, fix a problem and make a loyal customer.
If you realize that, at the end of the road, you aren't going to end with a
happy person and a working product, end the conversation as quickly and
pleasantly as possible._

In that context, I think his approach is very rational. If you pushed him,
he'd probably agree that more often than not the issue is in his code (even if
it's just a question of inadequate error handling). However, if the problem is
only seen by a single user and will be a significant investment to try and
fix, then it's simply not worth the time when he could be working on a new
game, a port, or even a different problem that has been seen by multiple
users.

------
tsewlliw
I dont get this, its so often a bug, and so many people dont report bugs, this
strategy of telling people to reboot or reinstall or redownload just
perpetuates these voodoo-style fixes.

Im not saying fix everything always immediately, but dont write people off as
victims of cosmic rays just because you can't repro in 30 seconds or dont see
the bug where youd expect in the code.

~~~
jodrellblank
He didn't say "cosmic rays" anywhere in the article.

 _voodoo-style fixes_

They're not voodoo, they're sledgehammer to smash a nut fixes. A reboot
reinitializes every part of your system into a mostly-known-good state. If you
knew what, you could say "restart this service" or "reinitialise this driver
like that", but a reboot gets all of it.

If you actually stabbed a doll with a pin and your program started working,
that would be ... scary.

~~~
tsewlliw
His actual criteria for taking the time to find a bug is reasonable, but I
take issue with the assertion that its not a bug in code he wrote most of the
time.

------
wccrawford
Only checking for a bug reminds me of the Intel bug that they claimed would
hardly ever happen, but turned out to happen a LOT.

I don't ignore bugs. I follow the same first step, and send the standard list
of things to try like reinstalling, rebooting, etc. But if they still have it,
I always look into it. Almost every time it's been a real bug. Some were
really hard to track down, but would have caused a lot of grief later. I was
always glad I did it.

~~~
Quarrelsome
Have you encountered a hardware defect yet? If/when you do, it represents a
lot of technically dead time that was spent looking at code. I'm not saying
either premise is right but I can appreciate his philosophy here.

For the record, I killed four weeks digging through code and running tests and
it turned out that temperatures in winter coupled with some bad soldering was
the cause of the issue. D:

~~~
JoeAltmaier
Yes, you have to track them down, yes it takes forever for the hard ones. But
most of the time it Isn't hardware, most of the time it my own bug and I can
fix it.

It definitely takes some experience to be good at debugging. I guess that's
why all the emphasis on development environments these days, where the hard
stuff is being debugged by someone else and I can work on my app-level stuff
in peace.

------
k33n
This fact drives me insane. It's theoretically possible to write a perfect
piece of software that can never fall down, break, blow up, etc. But it's
actually pretty much impossible in practice unless you have either near
unlimited resources (NASA in the 60's), but even with that you still might
fail (Microsoft).

------
synnik
There are two completely different conclusions that I would draw from his
facts:

1) Most bugs are in code. But it might not be your code. Your code layers
itself on top of many other layers of code that are outside of your control.
Learning to deal with that will make a difference in your work.

2) Know how everything works. I am always hocked at people who claim to be web
developers who don't even understand how an HTTP request/response works, much
less what your browser does with the results. It is one of my interview
questions for tech folk - I ask them to explain to me exactly what happens on
the server when a browser sends it a request. Few people can give much detail
here. Most can only give a generic explanation of the actions taken, if that.

------
rickdale
My biz partner has a GPS system from garmin. He lives in the central time
zone, but works in the eastern time zone. Any time we use the GPS it will
always add an hour to our trip when we are in EST.

Programmers aren't perfect. Practice makes permanents.

~~~
JoeAltmaier
Ha! And my sister went to Egypt and looked up the gps distance to home (Iowa):
8000 miles. Off by 50%. Why? The programmer was doing cartesian distance
instead of great-circle. So yes, if she drilled a tunnel through the Earth's
core, it was only 8000 miles. :)

~~~
T-hawk
No, 8000 miles sounds about right for the great-circle distance on the surface
from Iowa to Egypt.

First, the diameter of the Earth is not quite 8000 miles, but 7926. So if
anything says more than 7926 (plus maybe the height of a mountain or
whatever), it's not calculating a straight line in Cartesian 3d space.

Second, that distance of 7926 miles would be from a point to its antipode.
Iowa is not antipodal to Egypt, not even close. The antipode of Iowa is in the
Indian Ocean and hundreds of miles from any land. The straight-line distance
from Iowa to Egypt through the Earth's sphere would be more like 6000 miles.

------
sedachv
Read Jim Gray's Why do computers stop and what can be done about it?
(<http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf>)

Excerpts:

"In the measured period, one out of 132 software faults was a Bohrbug, the
rest were Heisenbugs."

"[retry] routines had a 76% success rate in continuing system execution."

Cosmic rays or race conditions, transient bugs _are_ common.

