
What is a coder's worst nightmare? (2014) - oskarth
https://www.quora.com/What-is-a-coders-worst-nightmare/answer/Mick-Stute?srid=RBKZ&amp;share=1
======
musesum
"Best Fit" memory management. Malloc three buffers, in order: 64K, 1K, 64K.
Now free all three. Next allocate 1K. Guess where it resides in the heap? Yep:
the middle! After a while, memory was fragmented with 1K detritus. Even though
you were supposed to have 500K available, there wasn't room for a 64K buffer
anymore. This was the default for Microsoft's C compiler, in 1990. How did I
find out?

There was a project that was crashing into a blue screen of death after a few
hours. The crashing library had #defines that went 15 levels deep. The stack
dump wasn't very instructive. Even though the library ISV was based in Boston,
I had to fly out to Santa Cruz to talk to the developer directly. He was
somewhat belligerent: "you should have figured it out yourself!". 15 levels of
#defines. He was the perfect example of the other side of Dunning-Kruger
thinking his arcane shit was obvious.

But, that didn't solve the problem. His fix didn't. So, I flew back to Boston.
I started to trace everything with -ahem- #defines of the main APIs. It was
turning into a Zen Koan: the sound of one app crashing. Everything worked with
the Borland compiler. It was just the MSC compiler that failed. The company
was one of the top 3 ISVs back then. Meetings about what to do would last for
hours. There were technical politics: "Can we use Borland?" No; the company
had standardized on MS. I could talk to their engineers directly, though. But,
they didn't have a clue. So, I read though the docs to see what compiler
switches were available. And there it was: best-fit. Fixed!

So, I wrote an analysis about why best-fit was absolutely the worst possible
memory management and sent it to Microsoft. It took 8 years for MS to change
the default. So, the reason why software would blow up? Best fit. Why early
versions of Windows with go into blue-screen-of-death? Best fit. How many
developers had to deal with this? Worst nightmare in aggregate.

~~~
davidst
You just brought back old memories for me. I wrote the heap manager for
Borland C. I chose the next-fit algorithm for its balance of speed and
reasonable fragmentation under common use cases. Best-fit performs exactly as
you described.

~~~
dalke
I pulled out my old Minix (1987) book, which I keep for nostalgia. Tanenbaum
on p. 202 writes:

> The simplest algorithm is first fit... A minor variation of first fit is
> next fit. ... Simulations by Bays (1977) show that next first gives slightly
> worse performance than first fit. ... Another well-known algorithm is best
> fit. ... Best fit is slower than first fit.... Somewhat surprisingly, it
> also results in more wasted memory than first fit or next fit because it
> tends to fill up memory with tiny, useless holes. First fit generates larger
> holes on the average. ... one could think about worst fit... Simulation has
> shown that worst first is not a very good idea. ...

> quick fit ... has the same disadvantage as all schemes that sort by hole
> size, namely, when a process terminates or is swapped out, finding its
> neighbors to see if a merge is possible is expensive. If merging is not
> done, memory will quickly fragment into a large number of small, useless
> holes.

------
av9
Not having a job is my worst nightmare. Realizing your skills are being
undercut by a fresh college grad who can get enough pats on the back while
skidding along with code that works. Realizing the language you cradled has
now become outdated and the other engineers that have found out the new next 9
years of hotness are too good for you to compete with. Having to interview for
a new job and realizing that you can't pass the white board questions because
you have fallen behind with the expectations of young blood. That's my
nightmare, becoming irrelevant.

~~~
sbuttgereit
As much as I hate to say it: it's a healthy nightmare. I have it too, though I
came to development late in my career (My career was more managing
technologists in IT depts, but I turned out to be good with Oracle and
PostgreSQL and other development tasks and now get more requests for that...
which pays as well and is more fun to boot).

When I was young, I would go into interviews sometimes where the other
candidates where there in the lobby, too. Some were older and clearly had been
in technology a long time... but never advanced past where they were in their
"prime" as you made small talk. They weren't going to get the job nor should
they have. They allowed themselves to become as relevant as token ring.

As technologists we have to keep ourselves current. That keeping up with the
kids is really part of our profession: the technology evolves quickly and so
must we. But this is the part I like about technology work. There's always
something next to keep things fresh. A new technology, a new architectural
style, new needs, unsolvable problems that become solvable. Our profession
doesn't tolerate coasting. Good.

Other professions have this same problem, just not at our speed. The
proverbial buggy whip makers had to evolve with changing realities... they
just had more time to see the inevitable.

(As an aside, I play poker with a dedicated COBOL developer in his mid-50s...
so sometimes you can make old school pay, too :-) ).

~~~
gaius
It's nightmarish because you can't control it - software is driven by fashion.
You can carefully choose a good technology, patiently master it, then find
there's no demand anymore because a technology that's worse in every objective
measure, is getting all the blog posts and retweets this year, and all the
people into it are reinventing the wheel you knew 20 or 30 years ago, all of
them claiming to be innovating and inventing new stuff. Like the people who
knew Smalltalk in the 90s and were poised to take software engineering to the
next level but got stomped on by Java, say. Or the Perl people who were
_finally_ getting their act together with Modern Perl before they were stabbed
in the back by the Perl 6 camp. There are dozens of examples. Sometimes it
comes back around but if you knew ML in the 70s there was an _awfully_ long
wait for F# to go semi-mainstream...

~~~
SomeCallMeTim
I have been learning new technologies pretty much continuously. It's not
impossible, especially if you follow sites like Hacker News, to keep a finger
on the direction of the industry, and then try to stay on top of the next new
hot technology of the year.

But I hear you on the "worse" technology sometimes winning. You mentioned
Java; it was worse than just about all other major contenders, and is only
finally losing popularity.

On a current technology fad: React seems to be designed to ignore 40 years of
accumulated software best practices. [1] Separation of concerns? Who needs
that any more? And the rationale for it is that it allows _teams of 100
developers work together on an app._ Open standards? Nah, how about lock-in to
custom language extensions that will prevent you from migrating your code to
the _next_ web standard! Much better.

And how many app teams have 100 or more active developers? Probably fewer than
a dozen, and I submit that _none_ of them probably should. Certainly not the
Facebook app: It has a lot of features, but not _that_ many features, and yet
it has a 150Mb footprint. When I hear things like that, I can't help but fill
in "junior" or "mediocre" in front of "developers." React helps to prevent
people from breaking each others' code when you have bloated development teams
filled with junior developers. React has some cool ideas, but all told I think
it's a step backward for software engineering, and certainly isn't as much of
a help for small teams, especially if you want to have a CSS/SCSS/LESS expert
styling your product without having to dig through JSX files, for instance.

The Java rationale was similar, IMO: You can use the Java abstractions to
allow lots of mediocre developers to make contributions to a product without
breaking each others' code. At least not as frequently as when they can poke
right into random data structures and change things arbitrarily. If it weren't
for Google's decision to require Java for Android, I think Java would be
relegated to big company web backend development.

I do like React's idea of the Virtual DOM for optimization, but you can get
that without using React. [2] React Native is great for using native
components and driving them from JavaScript, but it's also not the only game
in town. [3]

Back to the original point, though: You can stay on top of the Hot New
Technologies, but when there are good technical reasons to use alternate
technologies, stay on top of those as well. And then explain clearly to your
clients (or employers) why the current fad is a fad, and how to get the key
benefits of that stack without its drawbacks. Oh, and choose clients (or
employers) who will listen to strong technical arguments. :)

[1] [https://www.pandastrike.com/posts/20150311-react-bad-
idea](https://www.pandastrike.com/posts/20150311-react-bad-idea)

[2] [https://github.com/Matt-Esch/virtual-dom](https://github.com/Matt-
Esch/virtual-dom)

[3] [https://www.nativescript.org/](https://www.nativescript.org/)

~~~
mercurial
> On a current technology fad: React seems to be designed to ignore 40 years
> of accumulated software best practices. [1] Separation of concerns? Who
> needs that any more? And the rationale for it is that it allows teams of 100
> developers work together on an app. Open standards? Nah, how about lock-in
> to custom language extensions that will prevent you from migrating your code
> to the next web standard! Much better.

That's some strawman right here. First, JSX is optional. Sencondly, JSX is
open. You may disagree whether it's a standard or not, but you know what? If
tomorrow you end up with a large codebase where you want to get rid of JSX,
you just apply a JSX transpiler on your existing codebase. Problem solved. As
for separation of concerns, you have a little bit more of a case. React does
allow you to put business logic in your views... just like 99% percent of
templating languages out there. But React does not _force_ you to do that. You
can have only the minimum amount of logic you need, and put most of your
frontend business logic in your store.

~~~
SomeCallMeTim
It's not just about the JSX part. That's a piece of the larger problem: The
fact that the HTML & CSS are being built up in JavaScript at all. This
troubles me at a deep level, and no matter the approach you use, it won't
likely be portable to another framework.

Templates using a "handlebars" syntax are pretty portable with minor tweaks,
in general.

~~~
spion
This happened because CSS and HTML are a totally broken model, useless for
writing larger applications. They don't provide any sort of modularity /
encapsulation - everything is in the open, present inside one big giant
namespace with class names colliding with other class names and styles
cascading through / being inherited by everything you want or don't want.

Web Components / HTML Imports kind of solve this though, but they're still not
there.
[http://caniuse.com/#search=components](http://caniuse.com/#search=components)

JavaScript has lexical scope, and that pretty much solves everything. CSS
styles are now local variables and need to be passed around. Components are
not just a dump of HTML, but opaque functions - their inner structure cannot
be (accidentaly) accessed by outside components. Ah - finally, a sensible
programming model.

Just imagine what would be possible if CSS class names were lexical, not
global strings - and you could import them into HTML (perhaps with renaming).
How big of a change that would be in terms of making things easier to
separate.

Well... React users got tired of imagining things :)
[http://glenmaddern.com/articles/css-
modules](http://glenmaddern.com/articles/css-modules)

~~~
spion
Of course Web Components solve it by reinventing _all_ the possible wheels.
Tim forbid that we get standard lexical scope and parameter passing.

------
SilasX
This is funny, but it's actually more like a programmer's _second_ worst
nightmare -- "It doesn't work but it should".

The true worst nightmare is, "it works, but it shouldn't", because that means
your whole model of the domain was wrong, rather than some isolated, fixable
component.

~~~
forrestthewoods
1\. That can't happen

2\. That doesn't happen on my machine.

3\. That shouldn't happen.

4\. What does that happen?

5\. Oh, I see.

6\. How did that ever work?

Stage six is the worst. Especially when I wrote the original code.

([http://plasmasturm.org/log/6debug/](http://plasmasturm.org/log/6debug/))

~~~
hitekker
Yep, that's a good one! My first time seeing it was on
[http://bash.org/?950581](http://bash.org/?950581)

It comes with a nice follow up after the first 6 rules:

< MatthewWilkes> 7\. svn blame

< miniwark> 8\. one day we will write tests

~~~
golergka
> < MatthewWilkes> 7\. svn blame

Am I the only one who runs git blame (for the commit, not the author) 20
minutes into investigating the bug and git bisect an hour in? They are
excellent tools to find when and how did the bug first occur

~~~
klibertp
In the "Beautiful Code" book, there is a "Beautiful Debugging" chapter. The
author describes how things like `git bisect` work. I had some success
convincing people to use bisect at work by pointing them to this chapter.

------
joenathan
I'm not a programmer but a Sysadmin. I ran into an issue a few months ago that
nearly drove me crazy. I was custom building a new server for a customer, very
nice build, Xeon E5 10 Core, 64GB DDR ECC RAM, Windows Server 2012 R2, SSDs,
etc...

Everything is going well, update and configure bios, install Windows, install
drivers and software. Then start configuring the server, sever needs to
reboot, reboots to a blue screen of death. Can't get it to boot up normally.
Ok must be bad software, driver, etc... Time for a clean install, everything
is going fine and then reboot -> blue screen. Look up the bug check code, no
help. Another clean install, this time no extra software, no drivers, same
problem after a reboot. Finally figure out that it is only happening after I
make the computer a domain controller. After Googling with this new
information, I find one thread on some forum where people are having the same
problem, turns out it was the solid state drive, if you use a Samsung 850 Evo
SSD with Windows Server 2012, it will blue screen if you make it a domain
controller. I never thought a problem like this was possible. Sure enough
changed the installation drive and no more problems. Nearly drove me crazy,
took me two days of troubleshooting.

~~~
vl
Haha, these multi-hundred-thousand-line-source-code SSD firmwares need to do
something. In particular they detect popular filesystems and try to optimize
some operations (specifically if filesystem is recoverable by check disk after
crash, it's considered to be acceptable optimization; i.e. it's possible to
delay actual commit of free block info for example). Seems like detection
glitched on the blocks that contained data related to being domain controller.

~~~
digi_owl
Thinking about SSD and HDD firmware can make a guy reach for tape drives as a
last ditch hope that data stays just data.

At least with those (and to some degree optical media, but the write process
there is more laborious) the storage and the RW logic is different pieces.

------
rdtsc
Other coders' nightmares:

* Security : customers acccounts leaked, identities stolen, bank account depleted

* Undetected data corruption : customers' data is corrupted and it has seeped into the backup stream, even the off-site one, and that is corrupted as well now.

* Rarely occurring simingly random concurrency related bugs that manifest themselves once a month or longer.

* Time : making assumptions about it, dealing with it distributed systems etc. Nobody thinks about it until it stops working or systems' time gets snapped back somehow.

~~~
Splines
* Works on my machine: You can't repro a customer bug in-house. _Something_ about your internal setup is making a difference. Even worse: You work for a big-corp that is a platform for said customers, and so god knows what another team has running in your environment in the name of "dogfood all the things"

~~~
lotharbot
Years ago, my wife had a bug that only happened in a non-debug build and only
on one specific system. It was a recursive tree traversal algorithm, written
in C, with an OB1 error in some pointer arithmetic. If memory was allocated in
a particular order, the function would read off the end of level X's data and
find level X+1's data, which was structured exactly as expected, so it would
end up processing certain nodes twice. In the debug build there was a debug
variable being initialized to null, which was in between the two blocks of
data and therefore got interpreted as a terminator.

~~~
hacknat
What a horrible coincidence. I'm surprised they found it.

------
dalke
That seems more like a fictionalization of "Reflections on Trusting Trust"
than a true story.

~~~
joelbugarini
Mick (OP) responds to this assumption in the Quora comments:

 _I just read that. This guy certainly wasn 't Ken Thompson. But this happened
to me in 1991, seven years after the date of publication you posted, and the
"login" portion did the exact same thing Ken was describing as far as the
secret password. This was probably Thompson's genius idea implemented by the
grad student. The mechanism was very different though in the compiler, but the
outcome was still a poisoned compiler.

I had heard of self-modifying compiler before, too or I probably wouldn't have
thought to look there myself. I'm not surprised to see Thompson behind the
original idea._

~~~
keithpeter
I thought of the 'trusting trust' paper just as I reached the paragraph about
the compiler.

As an ex-PHB rather than a programmer, I think perhaps that Dr Phelps could
have had taken a bit more interest in the project &c. Perhaps had a look at
the code now and again as simply asked "why does this look so unlike other
people's code?". Might have sent a shot across the bows of the bad apple.

~~~
Drdrdrq
This. I wonder what made the graduate so angry to go to such extreme lengths
to protect his code. And why nobody noticed it - did they even have the right
to fix the code?

------
userbinator
That's not a worst nightmare for someone who knows how to read assembly
language and use a debugger. I would've narrowed that down to the compiler the
first time and probably debugged into it too (providing whoever did this
_very_ clever hack didn't also do something to the debugger... but if the
system is behaving this oddly, it's certainly better to bring one's own
tools.)

Seeing the CPU do something it shouldn't, being traced by a hardware ICE, now
_that_ is a worst nightmare - and one I've actually experienced.

~~~
mcherm
Story time? (Please!)

~~~
userbinator
It was a long time ago, but I remember we were working on an embedded system
controlling some industrial equipment, and it randomly crashed; the time
between crashes was long enough that it'd take several days before it
happened, so even getting a trace of the crash was an exercise in patience.
Eventually we did get a trace, and it turned out the CPU would suddenly start
fetching instructions and executing from a completely unexpected address,
despite no interrupts or other things that might cause it. We collected
several more traces (took around a month, because of its rarity) and the
addresses at which it occurred, and the address it jumped to, were different
every time. Replacing the CPU with a new one didn't fix it, and looking at the
bus signals with an oscilloscope showed nothing unusual - everything was
within spec. We asked the manufacturer and they were just as mystified and
said it couldn't happen, so we resorted to implementing a workaround that
involved resetting the system daily. Around a year after that, the CPU
manufacturer released a new revision, and one of the errata it fixed was
something like "certain sequences of instructions may rarely result in sudden
arbitrary control transfer" \- so we replaced the CPU with the new revision,
and the problem disappeared. We never did find out what exactly was wrong with
the first revision, other than the fact that it was silicon bug.

~~~
pdkl95
> the time between crashes was long

I know _that_ pain. I was working on the NDIS driver (WinNT 3.5.1, later a 4.0
beta) for our HIPPI[1] PCI cards. The hardware was based on our very-reliable
SBus cards, so when the PCI device started crashing, we assumed it must be a
software error.

I probably spent 2+ months trying to find the cause of the crash. Trying to
decide if your change had any affect at all when you have to wait anywhere
from 5 minutes to >10 hours for the crash to happen will drive you insane. You
have to fight the urge to "read the tea leaves"; you will see what you want to
see if you aren't careful.

While, I never did find the problem, I did discover that MSVC was dropping
"while (1) {...}" loops when they were in a macro, but compiled correctly when
they macros were changed to "for (;;) {...}".

Later, a hardware engineer took the time to try to randomly capture the entire
PCI interface in the logic analyzer, hoping to randomly capture what happened
before the crash. After another month+ of testing, it worked. He discovered
that the PCI interface chip we were using (AMCC) had a bug. If the PCI bus
master _deasserted_ the GNT# pin in exactly the same clock cycle that the card
_asserted_ the REQ# pin, the chip wouldn't notice that it lost the bus. The
card would continue to drive the bus instead of switching to tri-state, and
everything crashed.

Every read or write to the card was rolling 33MHz dice. Collisions were
unlikely, but with enough tries the crash was inevitable.

[1] [https://en.wikipedia.org/wiki/HIPPI](https://en.wikipedia.org/wiki/HIPPI)

~~~
jacquesm
Ok, that one should get the prize. The chances of spotting that are insanely
small, kudos on making progress at all, more kudos for eventually tracing it
down to the root. I really hope I'll never have anything that nasty on my
path.

~~~
pdkl95
Most of the credit goes to the hardware guys that were able to finally isolate
the problem.

We found the bug, but the months of delay (and a few other problems like
losing a big contract[1]) killed the startup a few months later. While I'm
annoyed the SCSI3-over-{HIPPI,ATM,FDDI} switch I got to work on was never
finished, the next job doing embedded Z80 stuff was a lot more fun... and a
LOT easier to debug.

Incidentally, I found a picture[2] of the NIC. Note that HIPPI is simplex -
you needed two cards and two 50-pin cables. This made the drivers extra "fun".

[1] "no, I'm not going to smuggle a bag of EEPROMs with the latest firmware
through Malaysian customs in my carry-on" (still hadn't found the bug at the
time)

[2]
[https://hsi.web.cern.ch/HSI/hippi/procintf/pcihippi/pcihipso...](https://hsi.web.cern.ch/HSI/hippi/procintf/pcihippi/pcihipso.htm)

~~~
jacquesm
That's a beautiful board. I remember the FAMP made by Herzberger & Co at
NIKHEF-H, I used to hang around their hardware labs when those were built.
Similar hairy debug sessions in progress. Those worked well in the end iirc
and ended up at CERN.

------
hashkb
_I was hired by a psychologist_

My money was on the psychologist the whole time. I still kind of think it was
Dr. Phelps and maybe Mark the admin and the AT&T tech are grad students in
disguise. So I guess my worst nightmare is finding myself in a similar
situation and later finding out my boss set the whole thing up for funzies.

~~~
emcrazyone
I was hired by a bookie ;)

I have a somewhat related story. Circa 1995 or so, I was working on software
compiled on a 486DX computer. Back then the computers had L2 cache in SDRAM
chips you plugged into the motherboard.

I was moonlighting with a few guys to develop software and each night myself
and two friends met up in office we were renting. There, on one of those fold
out tables, we toiled away each night writing code.

The code we worked on was for off shore gambling (aka a bookie). Back in the
90s you could walk into a convenient store and pick up an magazine; usually
autotrader magazine. On the back was an 800 number you could call to place
bets on football, hokey, or baseball. (Called HOBs bets in the industry).

Anyway, because of the nature of the software and the people we dealt with,
things were a bit hairy at times. The bookie we worked for paid us generously
but, at the same time, expected perfection (i.e. software that just worked).

One evening, as usual, the three of us met to work toward the next release of
the software. The next release of the software was to include a feature for
boxing bets. Mike Tyson was about to be release from jail and the bookie was
anticipating various future business in taking boxing bets.

So this one evening we were doing a software build to release as beta
software. I did the software build and ran a battery of tests we normally run
and all worked fine. To transfer the software to the offshore network we used
a dial up connection.

I pushed the software out which my bookie contact would run and make sure all
the logic was correct.

A few days go by and we get a phone call that the software is not functioning.
Sometimes it crashes (which was rare) or sometimes just strange things would
happen such as strange blinking characters on the screen.

I collected enough information to try and replicate the steps necessary. You
can imagine my customer wasn't too happy. So the three of us basically stopped
working on the code to track down the bug. We spent a week looking for the bug
we were sure was in the code somewhere. Back then we didn't really use a code
repository but we did have a diff of the beta release and previous release.

Pouring over the code changes we just could not find the problem. It even got
to the point where I would recompile the beta version, run it, and could
actually get the code to crash. Comparing to earlier release it would not
crash.

It was by chance that my fellow coder, sitting right next to me, started
running similar tests on his machine while I was off making a food run. It was
sort of normal for us to group up around my workstation when something needed
a collective look.

Long story short it turned out that one of the L2 cache chips was causing my
compiles to become corrupt in just the right way to cause the code to be not
as we typed it in.

Anyway, thought I'd share. Although not a malicious act, it was one of the
worst things to track down. Fast forward to my current life... I work in the
embedded field and recently solved a bit-flip problem in one of the products
my company produces that relies on NAND memory. That little incident from my
past certainly has served me.... ;)

~~~
perlgeek
I'm curious, how did you manage to pin it down to an L2 cache in particular?
(As opposed to general memory corruption, for example).

~~~
emcrazyone
ah good question. It was very, very difficult to pin down. We knew our
compiler was the same on both computers and compiler flags were all the same.
The two machines were purchased at the same time and we knew everything about
them was the same.

We used something called a Pharlap DOS extender so our software could use
beyond the 1MB boundary. It was in fiddling with that memory extender that I
began to suspect a memory failure. Changing it's parameters eventually got me
to fairly repeatable way to get the issue to show up not only in my app but in
3rd party software began failing too. Also, while we mostly did our work from
a DOS prompt and a DOS editor (called qc) we would sometimes run Windows
(Windows 3.11) which had it's own way of accessing Expanded vs. Extended
memory through a DOS driver. So by swapping between the Pharlap driver and the
Microsoft config.sys driver did I begin to suspect memory failure. In other
words, the machine became more noticeably more unstable.

I don't recall the exact name but I began running something like memcheck on
the computers. Basically a memory checker and although it did not reveal the
problem in its entirety, the memchecker would crash on my machine and not any
of the others. These were computers running just DOS with Novell Netware
network drivers for DOS/Netwware only.

I reasoned that memory was failing before the memory walker/checker was
getting corrupt. When I swapped main memory with another computer and the
problem didn't follow the memory, my only other option was to pull out the L2
cache chips. When I put them in the computer next to mine, and finally saw the
problem show up there I knew.

My friend owned the computer company that sold us the computers and he later
was able to test and validate my findings.

------
devonkim
Aside from the issues where all your assumptions of your tool chains, OS, and
computational reality is melting around you this seems a lot related to lack
of feeling of control and infomation. You have to keep calling people from
vendors and other departments because your talent, knowledge, and skills hit a
brick wall in the face of bureaucracy / compartmentalization. It is that you
don't know what to trust anymore, you have to be resourceful with time, and
you have limited access to the information / control needed.

This is how I explain why my technical job is difficult in a typical
enterprise organization and why you need to be far better than average to
produce merely mediocre results.

------
logicallee
Here's a coder's worst nightmare: "I managed to get reflection working* in
C++, by registering class functions in static variables. This lets you
configure routes like in Ruby on Rails. Since C++ is somuch faster, can you
port the rest of Rails over as well? You'll just be doing this for the next
4-5 months."

* [https://news.ycombinator.com/item?id=10607029](https://news.ycombinator.com/item?id=10607029)

------
mbrock
I worked on software that encrypted entries with a custom C++ plugin to SQL
Server. Nobody knew how it worked, it was implemented long ago and considered
magic. Some entries on some installations sometimes would become totally
garbled, which was a seriously dangerous issue.

This was a pure Java shop and I was almost the only engineer who even knew C++
since the reorganization that happened before I joined.

Turns out the affected installations all used custom compiled DLLs with hard
coded specific crypto keys made up by someone long ago. We didn't really know
how to deploy new versions of this.

But I found some version of the code at least, on some old file server, so I
could sort of debug it. It took me a while to figure out but after working on
it on and off for a few weeks I found a crucial race condition in the code
path that looked up the crypto key.

Turns out if you were unlucky after a SQL Server reboot, a few entries could
get encrypted with a "default" key. So I was able to decrypt the garbled
entries on affected customers at least.

We decided to change the whole method of encryption since it was obviously
hacked together by some cowboy coder a decade ago.

But in the meantime, the easiest way to fix the issue for the affected
customers was a method I taught to one of the tech support guys.

I told him to RDP into the affected server, open the crypto DLL in Notepad++
and do a string substitution on a particular ASCII sequence that would change
a certain character into a tilde. And then reload the DLL with a SQL Server
command.

Because through disassembling the binary I found that this would sufficiently
increase a certain value and fix the bogus comparison that caused the race
condition.

The nightmare here was mostly because of custom DLLs running semi-unknown code
on remote servers.

------
OSButler
As a contractor, the fear is more on the client side than with the code.
Coding problems can be resolved or worked around eventually, but the client
can be a wildcard. In developer meet-ups I usually hear about coding issues
like they are a puzzle, whose solution is being shared with the group. But the
scary stories are usually centered around clients, since problems can arise
there even if you provide perfect code.

~~~
ju-st
When I read the topic my first thought was "customers!"

------
macspoofing
Make a technology, platform, architecture or framework choice. After a
significant amount of effort realize you made the wrong choice.

~~~
collyw
Worse when you make the right choice but everyone else flocks to the crappy
choice.

------
DrScump
Management that doesn't care about regressions then later also sticks you with
placating the customers who get screwed by those regressions.

~~~
NegativeK
Management that does care about regressions and you both still get stuck with
placating the customer, because the customer always wanted features instead of
dealing with debt.

------
DblPlusUngood
Here's another story: I was optimizing the performance of a small, in-house
x86_64 OS on a specific benchmark. I spent many months finding and
removing/improving bottlenecks.

Once the performance was similar to competing OSes, the measurements from my
performance experiments were varying by 30%! This was surprising because the
benchmark was simple and the performance of competing OSes was completely
stable.

Well after three or four days of frustration, I discovered that the problem
was our test hardware: for whatever reason, _immediately after BIOS POST, any
memory loads executed for the next 10 seconds were 30% slower than memory
loads executed almost exactly 10 seconds after BIOS POST_. Initially I was
shocked, but this may make sense -- it seems plausible that hardware
manufacturers would run initialization code in the background resulting in
quicker boot times.

Anyway, after we figured that out we simply waited 10 seconds after POST
before we ran any experiments. I'm curious how common this slow-for-the-
first-10-seconds thing is with other x86_64 machines.

Edited for clarity.

------
thathoo
Apart from the obvious concurrency based or non-easily reproducible issues,
some of my personal nightmare scenarios:

1\. Uncommented code, but especially uncommented regular expressions 2\.
Unnecessary abstraction that runs too deep & so is hard to comprehend 3\.
Also, take any tractable problem and add timezones to it. Now you have an
intractable problem!

------
hacknat
My team and I are in the middle of a bad one right now. Our site is allocating
incredible amounts of memory per page request, and it took us forever to
figure out we didn't have a memory leak, at least in the traditional sense.
We're running .NET, and its GC scheduler has 3 tiers for objects 0,1, and 2. 0
and 1 are short lived objects, but after an object has been around long enough
the GC assumes it will probably be around forever and the 2nd tier garbage
collector gets almost no time to run. We proved this by forcing the GC-2 to
run longer and it brings the memory under control, so we don't have a "leak",
per se, but we still can't figure out why these ephemeral objects are sticking
around long enough to get so promoted.

~~~
grandinj
Increase the size of your Eden space. You probably have something allocating
very large amounts of temp objects, which is exhausting Eden. You should
optimism that, whatever it is, but bumping up Eden will give you some
breathing room.

------
melted
Leading a project where in order to just not fall behind on maintenance and
user support you need 2 people, and 2 people is all you're gonna get for the
foreseeable future, yet management demands forward progress out of you.
Fucking demoralizing and degrading.

~~~
vox_mollis
Nightmare? This is the status quo in the majority of small-but-not-tiny
organizations.

~~~
melted
Doesn't make it any less of a nightmare. I wonder what percentage of effort is
wasted in the industry this way.

------
joshu
this seems like one of those stories that got slightly more intricate every
time it was told. and it's been told a lot.

------
Drdrdrq
Internet explorer. A custom CMS. And sometimes some of the posted data is
missing... On a single POS. And because data ath end of form is optional,
system can't even figure out that something went wrong, it just silently
discards it.

Never figured out what problems that particular IE had, but the workaround was
simple... I added an extra hidden input field just to check if all was sent.
The theory was that at least operator could repost the form if it went wrong -
however, it never did again. Thank you IE for your obscure bugs and
nonstandard behaviour! Rip. </rant>

------
cpeterso
My "favorite" bug was debugging a crash in Macromedia's Flash Player for
Windows Pocket PC. It was an ActiveX control running in the Pocket IE browser.
The crash was intermittent and only happened after running a 15-minute
automated test suite on the Pocket PC, crashing deep in Pocket IE code. After
staring at code print outs for a couple weeks, I found the bug: a BSTR
returned from a COM interface was freed using CoTaskMemFree() instead of
SysFreeString(). ARGH! A one-line fix.

------
digi_owl
Stories like these makes one wonder if "mere mortals" should just turn things
off, scrap them in a industrial chipper, and go back to pen and paper.

------
0xCMP
You know, I know everything posted on the internet is fact /s... but I really
hope this story is true cause it's just crazy.

I mean, who want to think that the main thing that would let you create code
is the thing that is infected? And it seems like there weren't any other
options for languages (which hopefully weren't compromised as well) so there
would be no really getting around it.

~~~
vezzy-fnord
_I mean, who want to think that the main thing that would let you create code
is the thing that is infected?_

I'd wager that people who work with esoteric embedded toolchains develop an
instinctive distrust in that layer.

------
CM30
That's a pretty messed up story, though I'm not sure it'd be a coder's worst
nightmare. I'd think that'd be if some major project you were worked on got
hacked... while you were enjoying the weekend off and had no access to your
work computer.

Or just if your project developed a show stopping bug at that point in
general, since you'd get the feeling that 'oh god, we've been done by two
days' or 'we've been serving malware for two days'.

Or maybe if you'd found a breaking bug that needed fixed, yet no answers were
available online, no one in the room could help and the site was due to go
live that night. All that time worrying, refreshing your Stackoverflow
question and trying in vain to find a solution would be torture.

~~~
pjc50
The Knight Capital bug is probably most people's worst nightmare:
[http://pythonsweetness.tumblr.com/post/64740079543/how-to-
lo...](http://pythonsweetness.tumblr.com/post/64740079543/how-to-
lose-172222-a-second-for-45-minutes)

------
nomercy400
That's not a nightmare. Somebody else caused it, and you aren't to blame.

It's not the code that causes the nightmare, it's the result of the code. Say,
a live production system, where something went wrong due to your coding,
causing a great irreversible loss. You realize it the moment it happened, and
you can't do anything about it.

Closest I got: data inconsistencies in a +5-year old, live core production
database of a large company, caused by a coding error. You don't know where it
is wrong, you don't have means to fix it (it was wrong from the start), you do
know how much was lost because of it. Only 'good' thing was that it wasn't my
fault.

------
datashovel
"You don't have to pay me anymore, Dr. Phelps, I just want lab time." This is
nerd war.

LOL

~~~
datashovel
This also came to mind while reading:

[https://pbs.twimg.com/media/Bp_KeZnCAAAijRe.jpg](https://pbs.twimg.com/media/Bp_KeZnCAAAijRe.jpg)

------
teddyh
[http://www.commitstrip.com/en/2014/10/31/a-coder-
nightmare/](http://www.commitstrip.com/en/2014/10/31/a-coder-nightmare/)

------
matobago
Once that you start using version control systems, this shouldn't be a
problem.

A real nightmare for me was when I inherit a bunch of code full of uncommented
regular expressions

~~~
greggyb
How does a version control system address a malicious compiler?

Unless you're suggesting the entire system be under VCS?

------
codingdave
About 2 paragraphs in, I was thinking, "Tell me this is not a story about
someone who spent weeks figuring out that his compiler had been compromised."

Ah, well. The legends of self-modifying compilers are old at this point:
[http://www.ece.cmu.edu/~ganger/712.fall02/papers/p761-thomps...](http://www.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf)

~~~
soyiuz
Yes, why did s/he not isolate and compile on another machine in step 1?

------
crb002
These days you can just run strace and see all the OS calls.

------
meshko
Unicode and timezones.

------
todd8
The OP is an interesting story. See Ken Thompson's Turning Award lecture[1]
from about that time frame; it is strangely similar.

[1] Reflections on Trusting Trust, Ken Thompson, Aug 1984.
[https://www.ece.cmu.edu/~ganger/712.fall02/papers/p761-thomp...](https://www.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf)

------
martin1975
Reverse scheduling. That's the worst one by far.

------
ajarmst
Yet another example of why all programmers should be familiar with
"Reflections on Trusting Trust."
([http://www.ece.cmu.edu/~ganger/712.fall02/papers/p761-thomps...](http://www.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf)).
Clearly the malware author was.

------
irascible
This is pretty unbelievable.

~~~
seanwilson
Yeah...I would have compiled the code on a different machine then wiped the
original machine if it had clearly been hacked.

~~~
mytochar
I kept wondering that as well. I'm wondering if this was a different time, or
if it's just a fun story

~~~
mwill
3B2, so probably like early-mid-80's. Likely didn't have many options.

------
facepalm
I've heard another anecdote from university where somebody changed some late
digit of PI in the compiler. Very hard to detect...

------
shams93
Being so sleep deprived on a new gig that you mistake myvar: for :myvar

------
shabbaa
Coder's worst nightmare? When you are the last line before something gets
pushed into production, and the fear that it will fail IN production.. have i
missed something?

------
jimmeyotoole
That's so messed up. I'm so glad I started my development career in an age
where one could spin up a VM in minutes to isolate an issue like this.

------
dijit
forgive me for saying, I'm not a coder, I'm a sysadmin.

I'm scared of having developers with a little ops knowledge take over my job.

/devops/

~~~
ownagefool
You probably should be afraid. Those developers are going to replace your
tools with better tools, that they'll know and you will no longer have years
of arcane experience to fall back on.

Developers spend a lot more time thinking about abstractions, where to place
things and creating & controlling nThings. If you cannot do these things well,
you need to learn them.

The best way to do that will be to ride the bandwagon, work together with devs
and learn from each other.

------
bdotdub
Time zones

------
frik
Software that crashes while presenting it to the CEO or on investors day.

~~~
simula67
[https://www.youtube.com/watch?v=IW7Rqwwth84](https://www.youtube.com/watch?v=IW7Rqwwth84)

~~~
i336_
How did I just know what video this was before I clicked it..

------
gesman
Bottom line: insists on hourly rate.

~~~
fishanz
I'm in this camp, but I keep hearing stories about the pot 'o gold. "Charge
what it's worth to the client." This happens?!

------
cagenut
being held responsible (on-call) for how it works in production

------
Jupe
COM. 'nuff said.

------
forgottenacc56
Programming.

------
jbombadil
Users with ideas

------
lunulata
For the love of Zeus... it's called git, use it!

~~~
davb
The 3B2 computer mentioned in the story was around in the mid-1980s. Git's
initial release was in 2005.

~~~
iLoch
Not to mention the malicious code was in the compiler, not the source code.

