

Something is erasing my program while it's running (oh wait oops) - jvns
http://jvns.ca/blog/2013/12/16/day-43-hopefully-the-last-day-spent-fixing-linker-problems/

======
jasonkester
Ah, bummer. I was hoping it would end up being something cooler.

Like in the 80's, writing games for the Commodore 64... Where they kept the
memory for the program at a location shortly after where they kept the screen
memory... So if you forgot to bounds-check the coordinates of the little guy
you had running around on the screen, you could run him right off the bottom.

Where he'd start overwriting random bits of memory. Until eventually he walked
right over your source code. Which was being interpreted. Until, of course,
the guy caught up with the current execution point.

At which point, you hoped you had spent 10 minutes saving a copy of your code
recently. Because listing your source you'd find it speckled with random
special characters and changed values. Tire tracks from that little guy.

And you learned that the first thing you do when implementing that joystick
routine is to also implement bounds checking.

~~~
jerf
"I was hoping it would end up being something cooler."

It reminded me of this classic GHC error:
[https://twitter.com/bos31337/status/116372971509121025](https://twitter.com/bos31337/status/116372971509121025)

Not quite the same, but still "outright program destruction" of a sort.

------
LeafStorm
Ah, the wonders of low-level assembly programming. I was a teaching assistant
for an assembly course this semester at school, and one of the students was
running a program in the debugger. (Specifically, Microsoft CodeView running
in DOSBox.) F10, F10, F10, until she got to the load instruction that would
fill the cx register. Press F10 again, and suddenly, the debugger shows "cx =
DEAD."

She freaks out a bit, I freak out a lot. We're both worried that something
really bad happened, then I realized that DEAD is actually a number in hex. I
look through the data segment, and I see a AD DE in there (yay Intel byte
reversal!).

I check through the code a bit, and even though it looks like the data segment
is being initialized properly, it's still reading from the Program Segment
Prefix. So, I find the Wikipedia article on the PSP and see that address 05h
in the PSP has a cross-segment jump to CP/M compatibility code.

But, there's nothing at address DEAD (at least, nothing sensible). So I go
searching through the DOSBox code for that string, and I find these lines:

    
    
        // far call to interrupt 0x21 - faked for bill & ted
        // lets hope nobody really uses this address
        sSave(sPSP,cpm_entry,RealMake(0xDEAD,0xFFFF));
    

As it turns out, because DOSBox doesn't implement CP/M compatibility, they
simply made it jump to DEAD instead, just to make sure that if anyone tried
running a CP/M program they would get the picture. And the assembler just
happened to put the variable in the right place that the load pulled the DEAD
instead of random nonsense.

I can only imagine the problems that can manifest when you're dealing directly
with hardware, and don't have access to commented source code.

~~~
warp
I guess bill & ted refers to
[http://en.wikipedia.org/wiki/Bill_&_Ted%27s_Excellent_Advent...](http://en.wikipedia.org/wiki/Bill_&_Ted%27s_Excellent_Adventure_%281990_video_game%29)

~~~
maxerickson
In the sequel they go on a Bogus Journey and encounter death.

~~~
chiph
There are rumors that there will be a 3rd movie. Not sure if they'll be able
to find another Rufus - Carlin is irreplaceable.

------
greenyoda
_Allison asks: “What linker debugging strategies do you have?”

Change the linker script randomly (actual thing that has worked)

Change variable attributes from ‘private’ to ‘public’ at random (actual other
thing that has worked)..._

Changing stuff at random (without having any idea why you're doing it and
without keeping track of what you've already done) isn't a very productive
debugging strategy - it's more like an act of desperation. At least use the
scientific method:

\- Make a hypothesis about what might happen if you change X.

\- Observe and record what actually happens.

\- Think about why this could be, and then think about what experiment you
could do next to get another step closer to the answer.

If you have a hard time coming up with a reasonable hypothesis, it might help
to learn more about the system. In this case, reading the linker documentation
to find out what the various options do could be more productive than just
making random changes to the linker script.

~~~
pja
_In this case, reading the linker documentation to find out what the various
options do could be more productive than just making random changes to the
linker script._

If you haven't been faced with a mystifying bug where randomly futzing with
things in the hope that you would find a change that had a reliable effect was
the best hope you had of solving it then you probably haven't been programming
long enough.

~~~
greenyoda
Haven't been programming long enough? If you're the average age on HN, I may
have been programming before you were born. :)

And I've dealt with lots of mystifying bugs. But my approach to debugging
isn't trying random things, it's gathering more information in deliberate
ways: reading the code, tracing code with the debugger, adding logging
information if the problem doesn't reproduce under the debugger, running test
cases that attempt to narrow down the scope of the bug, etc.

I've even debugged customer problems that I couldn't reproduce locally by
looking at the code and deducing what execution path had to have occurred to
account for the reported behavior. That's one situation in which you
absolutely can't try random things to see if they work.

If I were truly making random changes, it would suggest to me that I had no
understanding of how the system worked. If you do understand the system and
its potential failure modes, then you can do better than random.

~~~
pja
In this particular case the kernel binary loader was dropping half the binary
on the floor, so code would sometimes work and sometimes not in a way which
depended on both the environment and the layout of the compiled binary. I'm
not surprised the author ended up randomly futzing with things to see if they
could find a reliable cause! (This error could have come straight out of the
logout; essay that was linked on HN a week or two ago: "I have no tools
because I broke my tools with my tools!") Personally I think the reason she
was emphasising that changing these things made a difference was because it
was so surprising.

Now in your infinite and sagacious wisdom you've told us all how she should
_obviously_ have been able to debug this problem in a trice using your
superior methods. Me, I'm happy to cut her some slack, and recognise that
systems development can be a complete nightmare!

------
nl
There was a bug in the (Windows, 1.3.something) JVM around 2003 where if you
had more than 1024 files open (eg initialized the velocity templating engine
on each request as a random example) then it would _delete any subsequent
files it opened_.

Needless to say that was a bugger to debug. We didn't believe what was
happening until we literally saw a jar disappear before our eyes.

Sun fixed the problem pretty quickly!

------
MattBearman
I once made a somewhat similar error. Many years ago, when I was just learning
PHP, I attempted to write a function to delete all the contents of a
directory, including subdirectories.

I don't remember the exact code, but it was something like:

    
    
      function clear_dir($dir) {
        $contents = scandir($dir); // returns array of directory contents
    
        foreach($dir => $file) {
          if(is_dir($file)) {
            clear_dir($file);
          }
          else {
            unlink($file);
          }
        }
      }
    

_The function names may not all be correct, been a while since I PHP 'd_

Anyway, so the code looks like it should work. The problem is the scandir
function result includes __. __and __.. __

So my little function would see __.. __, recognise it as a directory, and
recursively call itself on the parent of the directory I wanted to empty. I
lost a fair few files before I realised what was happening. Luckily I had back
ups, as this was before I started using SCMs.

------
sbuccini
Is this project typical of what people are working on at Hacker School?

------
cognivore
Reminds me of one of my favorite work conversations: "Hey Matt, you have any
self deleting code?" "I used to."

------
LeafStorm
Also, this post jumped from #20 on the front page to somewhere below #360 in
the time it took me to write my other comment.

~~~
shubhamjain
Stupid question probably, but how did you guess its ranking came down to #360.
Is there a feature of HN I am missing?

~~~
anon1385
This site will give you a graph:
[http://hnrankings.info/6919666/](http://hnrankings.info/6919666/)

