
Ask HN: Strange bug workarounds? - porjo
Software bugs are a fact of life and, sadly, many never see a (timely) fix. This can lead to some some unusual workarounds in order to continue using the software.<p>What are some unusual&#x2F;quirky&#x2F;bizarre workarounds to software bugs that have been encountered by the HN crowd?<p>A recent one I struck was with Google Earth desktop app on Linux. It has a tendency to crash on startup unless your mouse is contained within a small rectangle in the middle of the screen [1].<p>[1] http:&#x2F;&#x2F;askubuntu.com&#x2F;questions&#x2F;642027&#x2F;google-earth-crashes-when-opened#comment1071599_677717
======
mbrock
I worked on health record software. An elusive bug in the custom SQL Server
crypto plugin led to very occasional corrupted entries, which was very bad.

The guy who wrote the crypto plugin had of course quit and nobody knew how it
worked.

Fine-combing the C++, I found an off-by-one error that would cause the
predicted failures: after rebooting SQL Server, the first entry would get
encrypted with a zero key. (Hooray, we could now also fix all the corrupted
data.)

For various reasons it would have been difficult to ship new DLLs to the
affected customers. Only a handful used this particular crypto and it would be
much easier to patch the existing binary DLLs on their servers.

Well... looking at the machine code, I found that the troublesome off-by-one
operations were actually in the printable ASCII range... so I just taught my
friend in tech support to do a particular obscure search and replace in
Notepad++, something like changing ",}" into ",~" in the binary DLL... and
then hot-reload it with an SQL Server command... worked perfectly.

~~~
obituary_latte
Nice! Must have made that tech feel and look like a hero :)

------
throwaway_yy2Di
Not my workaround:

[http://spectrum.ieee.org/aerospace/space-flight/titan-
callin...](http://spectrum.ieee.org/aerospace/space-flight/titan-calling)

[http://descanso.jpl.nasa.gov/seminars/abstracts/viewgraphs/H...](http://descanso.jpl.nasa.gov/seminars/abstracts/viewgraphs/Huygens_to_DESCANSO.pdf)

This was an extremely serious bug in NASA/ESA's Cassini-Huygens probe, in the
S-band link between Huygens (landing on Saturn's moon Titan) and Cassini
(acting as radio relay).

It was a timing bug. There'd be a very high relative velocity between Cassini
and Huygens, creating a significant (~2e-5) Doppler shift in the link. This
shifted the frequency of the 2 GHz carrier (by 38 kHz). Likewise, it shifted
the symbol rate of the 16 kbps bit stream (by 0.3 bps). The second effect was
overlooked. On the demodulating end (Cassini), the bit-synchronizer expected
the nominal bit rate, not the Doppler-shifted bit rate. Since its bandwidth
was narrower than the 0.3 bps Doppler shift, it was unable to recognize frame
syncs; this was proven in experiments post-launch. The parameter that set the
bitrate was stored in non-modifiable firmware.

As it was when launched, Huygens would be unable to return any instrument
data. For some context, this was the only probe that's ever visited Titan, at
a cost of about $400 million.

The workaround

[spoiler]

The workaround was a major change in the orbit trajectory of Cassini (a $3
billion probe). Details aside, it set up an orbit geometry with this feature:
at the time Huygens was descending in Titan's atmosphere, Cassini would be
flying at a ~90° angle to their separation. The relative velocity was still
20,000 kph, but _tangential_ velocity doesn't contribute to Doppler shift.

~~~
bbcbasic
That's a truly epic workaround!

------
dlinder
I worked on a social news product and part of our look was to have an icon for
every story - either an image pulled from the page, a user-uploaded image, or,
in the case of Flash content (say, a video player), a screen capture.

We had it all up and running - loading the content, waiting for the player to
initialize, taking the snapshot, generated sizes - on a windows machine when,
one day, the request came in to migrate that machine to a VM. After the
migration, things were fine - until we disconnected RDP. Snapshots were coming
back at the right size, but totally white.

The eventual "solution" was a laptop in the engineering area RDP'ed into this
VM to keep the snapshots from going white. It got unplugged one holiday
weekend, earning it a red hand-sharpied sign - "PRODUCTION LAPTOP: DO NOT
UNPLUG". It was unplugged again one fateful weekend, this time prompting a
healthcheck to be written that looked for all-white images in its output.

That rig ran that way, I believe, until someone had the insight to make a
second VM, this one RDP'ed into the first.

Turtles, all the way down!

~~~
mdip
That's awesome and the solution is not as uncommon as you'd imagine.

At "a large telecom" I used to work at, we had a specific process that handled
billing that relied on a DOS application which was written targeting a
specific modem's hardware. They'd tried to migrate it to something else for
quite some time but the guy who wrote it lived in a different state and was
let go from the company when we closed that site down and moved all of its
equipment to Detroit. It ran on an old Compaq (not HP Compaq, Compaq) desktop
PC and in 2014 or our VP received a frantic call that the drive had failed and
the computer wouldn't boot (from a younger tech who was used to working on
server class hardware). The code for this application had been lost forever
and nobody had any idea how it actually worked but my understanding was that
with it not functional, we were losing enough money to make it a "drop
everything priority".

They brought the machine over to my building and the VP of my department
called me to assist[0]. Sure enough, the system wouldn't even see the drive.
It was at this point that I noticed three numbers with the letters "C", "H",
"S" next to each. This had happened before, apparently, and someone discovered
the BIOS battery had died. Thankfully, they were kind enough to put the drive
parameters on a label for me. I popped into the BIOS, put 'em in and it
booted. The computer remained powered on in the cubicle I repaired it in (just
outside said VP's office) for a year until the dev team got around to
modernizing the code.

[0] I was not a support person at this time but was in the past and it wasn't
unusual for them to call me in on strange problems. I was also known for
having recovered a hard drive with important data on it using the break-room
fridge (though I'm not sure this VP was aware of that).

~~~
dlinder
You sound like a kindred spirit. I have put hard drives in freezers to release
stiction; I have baked motherboards in the oven to re-flow questionable
solder. I wonder if anything in our kitchen is sacred! Sometimes I wish I had
"MacGyvering goofy tech junk" as a full time job!

~~~
mdip
No doubt! Yup, I've done the oven thing, too (several PS3 motherboards as well
-- used to buy 'em broken on Craigslist when there was a chance they'd be
running older firmware and resell them).

Trick with the freezer hard drive: if you ever order perishable items over the
internet, they sometimes ship in boxes with large bags of "blue goo". Pop
those in the fridge and the next time you need to keep a drive spinning long
enough to get one last copy out of it, sandwich it between two of those. They
don't get cold enough to pick up condensation and short the drive and the blue
goo keeps cool for a _long_ time if the bags are large enough.

------
frereubu
Not so much a software bug, but back in my early days (late 1990s) supporting
an office network in London there was a computer where the mouse was making
the cursor behave erratically during roughly the same period every afternoon.
We swapped out the mouse, the controller card, even the computer - effectively
replacing all the physical equipment - and nothing seemed to stop it. We went
through all sorts of ideas - too near the microwave, heavy fax machine usage,
someone's mobile phone - until we realised that it was optical mouse, and the
sun would shine through that window each afternoon at the same time and screw
up the sensor in the mouse. We stuck a bit of cardboard to the side of the
desk and it never happened again.

~~~
nom
Haha awesome. I once was fooled by the sun, too. I noticed an unusual high
power consumption of several KWh in my logs. They always appeared at the same
time, almost up to the same minute.

So it turns out there is a very small time slot where the sun can reach
through a window into the hallway. That was enough to offset the light sensor
that I attached to the power meter inside the closet. The threshold was set
too tight.

Think about the possible sources that influence this 'bug': \- the month \-
the time of day \- the weather / state of the clouds \- open/close state of
the bathroom door \- reflectivity of the hallway (objects, doors open/closed)

~~~
heywire
Towards the end of summer, one of my Raspberry-pi security cameras starts
detecting "motion" in the form of the sunlight dancing on the wall when the
fluffy clouds float by :)

~~~
justinpombrio
Alert! Either break-in, or fluffy cloud!

~~~
throwanem
Alert! What a lovely afternoon!

------
mjg59
Samsung laptops would fail to boot if the UEFI variable store was 100% full.
The original solution to this in Linux was to leave at least 5K of free space.
However, on several systems, removing UEFI variables didn't actually free up
space - it was marked as free internally, but the reported amount of free
space didn't increase, and so Linux would refuse to allow you to create new
variables. The "solution" was to attempt to create a variable _larger_ than
the available free space, which forced the firmware to trigger a garbage
collection run and re-synchronise the internal and external views of the
amount of available free space. Doing something that we knew would fail was a
requirement for avoiding killing laptops.

~~~
yuhong
Interestingly, there recently has been
[https://github.com/Microsoft/BashOnWindows/issues/976](https://github.com/Microsoft/BashOnWindows/issues/976)

------
romanhn
Many years back, I was working on a web application that, among other things,
could generate PDF user reports. These reports were generated from HTML web
pages using a third-party library. Normally this worked well (as well as such
a tool could be expected to work anyways), however once a month or so the
fonts on the reports would come out super tiny. This would then happen in
random reports until we rebooted all of the app servers. The bug occurred in
production only, never in our dev, staging or QA environments.

Many hours of investigations were committed, many emails to the vendor were
written, much hair was torn out. No luck whatsoever. Months passed, and the
bug reoccurred at random intervals and did not consistently affect all
reports. One day I logged in remotely to one of the Windows app boxes as an
admin/console user and was annoyed to once again discover that it forced my
screen resolution to change. That's when I had an epiphany and 10 minutes
later was able to reproduce the bug in my local environment.

Turns out the third-party library had some funky rasterization logic that took
into account both the resolution of the machine when the library/service was
started as well as the current resolution, pretty much expecting both to be
the same. Logging in remotely as a console user has the behavior of taking on
the resolution of my local machine, which was always higher than what the
remote box ran at. Another thing to note is that the console user logged into
the same running instance of Windows that was generating the PDFs. BAM! The
cached value used by the library no longer matched the runtime resolution and
the reports now generated screwy tiny fonts. This happened rarely because
logging in as admin/console was not the recommended approach, and it was
inconsistent because we had multiple app boxes and the other ones continued to
work OK.

Solution - disallow admin/console remote logins. This was one of the most
obscure bugs I have had the pleasure of solving.

------
kogir
The Motorola iDEN [1] series of phones were pretty sweet back in their day and
had a JVM you could actually write and deploy apps on.

I worked on Loopt, an early mobile location sharing app, and we talked to our
server over HTTPS. Things were working great on a few LG and Sanyo phones, and
worked fine in the iDEN emulator, but POSTs would fail consistently on the
device itself. GETs worked fine.

After watching traffic on the server for a bit, I noticed the POST requests
all advertised HTTP/1.1 and sent the Expect: 100-Continue header. On a whim I
configured the server to treat all incoming connections as HTTP/1.0 so it
would never send the 100 (Continue) response [2].

It worked!

Or did it? Turns out the iDEN phones were now happy, but the other phones were
not and would refuse to send POST bodies if they didn't receive the 100
(Continue).

This well and truly sucked, and we thought for a bit we'd need to have two
different endpoints with different configurations to support the differently
incompatible phones. Lame.

But then I remembered the format of an HTTP request:

    
    
        POST /path HTTP/1.1\r\n
        Expect: 100-Continue\r\n
        [Header: Value]\r\n
        \r\n
        [Body]
    

What if I supplied a malformed URL? Something like "/path HTTP/1.0\r\nX-iDEN-
Ignore:"? Then, if there's no validation or encoding, the request will look
like this:

    
    
        POST /path HTTP/1.0\r\n
        X-iDen-Ignore: HTTP/1.1\r\n
        Expect: 100-Continue\r\n
        [Header: Value]\r\n
        \r\n
        [Body]
    

Turns out that worked. The JVM was never updated or fixed, the hack shipped,
and it worked consistently for the lifetime of those phones.

[1] [https://en.wikipedia.org/wiki/IDEN](https://en.wikipedia.org/wiki/IDEN)

[2] "An origin server ... MUST NOT send a 100 (Continue) response if such a
request comes from an HTTP/1.0 (or earlier) client"
[https://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html#sec8....](https://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html#sec8.2.3)

~~~
nhf
I remember iDEN phones! I had a Motorola i355 for a while. Software sucked but
the thing was an absolute tank. Plus, it was one of the rare "dumbphones" of
the time to have integrated GPS, so I remember using it as a GPS tracker for a
while afterwards.

------
josu
My favorite:

>Wing Commander was originally titled Squadron and later renamed Wingleader.
As development for Wing Commander came to a close, the EMM386 memory manager
the game used would give an exception when the user exited the game. It would
print out a message similar to "EMM386 Memory manager error..." with
additional information. The team could not isolate and fix the error and they
needed to ship it as soon as possible.

>As a work-around, one of the game's programmers, Ken Demarest III, hex-edited
the memory manager so it displayed a different message. Instead of the error
message, it printed "Thank you for playing Wing Commander."However, due to a
different bug the game went through another revision and the bug was fixed,
meaning this hack did not ship with the final release.

[https://en.wikipedia.org/wiki/Wing_Commander_(video_game)#De...](https://en.wikipedia.org/wiki/Wing_Commander_\(video_game\)#Development)

------
tominous
I worked on an HSM system (hybrid disk/tape archival) which suddenly started
having lots of I/O errors writing to tape. We tried new media. We tried new
drives. We double-checked cables and SFPs. No luck.

Finally we tracked down the issue: when the _contents_ of a particular file
were archived to tape, the tape drive crashed. I suspect it was a tape
firmware issue, maybe to do with the native compression.

The workaround was to mark that particular file as "not to be archived" and we
stopped having media and drive errors.

~~~
hga
Ah, yes, the "best", as in fastest by far with highest quality results OCR
software back in the '90s was owned by a company that was rumored to have
pissed off their core technical team, who left and only occasionally deigned
to do consulting for them.

So they had a wonderful core which was wrapped in baroque APIs, but the real
problem was that core wasn't entirely wonderful, occasionally when you
presented it a "Death TIFF" as we called the images for their file type, it
would reliably crash. Software or firmware versions of the code (they had a
hardware accelerated box with one or more Intel RISC chips), on the PC
platform at least, e.g. Windows 3.x using a DOS box, this would entirely lock
up the machine.

To get around this for a client that had 500,000 images to OCR on a tight
deadline for a legal case (and this was the golden area of legal document
imaging, back then lawyers would pay 50 cents per OCRed page, because a full
text search could e.g. impeach a witness on the stand in real time), I created
a system where the PC would _always_ be printing out asterisks if it was
OCRing pages. That allowed an operator to tour the machines and easily see
when he had to manually reboot one stuck on a Death TIFF, after which my
software would recognize what had happened and continue with the next image.

~~~
foone
I bet I worked with the same company when I was with the government. We had a
subcontractor who'd been hired to digitize something like 200 million paper
records (they made it about 50 million in before we ran out of funding). But a
small fraction of the TIFF files they generated wouldn't work with any of the
tools we had on hand.

It turned out that Windows 98 shipped with an Imaging program (Licensed by MS,
not written by them) which predated the standardization of the JPEG-in-TIFF
subformat, but they'd basically guessed at how it would work and shipped that.
The final spec (and the version of JPEG-in-TIFF nearly everyone else
implemented) ended up being different. So basically nothing could read it.

We ended up calling them up every time a customer found one of these files and
having them print out that image on one of their windows 98 machines, and scan
the printout back in using one of the newer machines. Sure, we lost some
quality, but at least the customers could access the data now.

For a time reference, these broken images were still showing up in newly
scanned documents in 2011 (when we stopped working with them due to massive
fraud), so they must have been using their Win98 scanner systems even then.

~~~
hga
No, to the best we could determine, and we had a guy who liked to get into the
weeds of CCITT Group 3 and 4 compression, it was the raw images themselves,
and there was nothing wrong with them, some just tickled a bug. If I remember
correctly, their API required stripping off the header and presenting the OCR
code with some metadata and the compressed image. It's been way too long for
me to remember the details, except that it was fairly obnoxious to interface
to, I couldn't just hand it a TIFF in some way (helped us VARs really "add
value" and earn our keep :-).

We were producing our own TIFF files using our own software that drove monster
Kodak ImageLink scanners (software I in fact took over, redid the SCSI driver
of, and eventually did a clean rewrite of the engine on Sun workstations), so
the images and their compression came straight from Kodak, and going further,
I don't recall those 600 pound beasts ever screwing up at that level.

And this was _way_ before Windows 98, it was Windows 3.0 or by then 3.1, like
in 1992, Windows was utterly naive about document image files. Which I can see
was a blessing (although maybe it was losing quality, I'd long switched to NT
by the time 98 came out).

~~~
foone
We also had weird CCITT Group 4 issues, because of someone trying to be extra
smart and convert TIFF to PDF without a recompress (PDF supports Group 4
compression too, so you can turn a Group4 TIFF into a Group4 PDF by just
swapping the header!)

I didn't mean it was definitely the same company, just a similarly annoying
TIFF issue.

------
cperciva
Grepping my checked-out source trees quickly:

1\. spiped re-binds SIGINT if it is launched as pid 1, in order to work around
a Docker bug:
[https://github.com/Tarsnap/spiped/blob/master/spiped/main.c#...](https://github.com/Tarsnap/spiped/blob/master/spiped/main.c#L268)

2\. In my POSIX-violation-workarounds script, ironically enough, I work around
a bug in bash which makes 'command -p sh' run with the incorrect path (this
has since been fixed, but continues to be present in older installed versions
of bash):
[https://github.com/Tarsnap/spiped/commit/e3968941c9c1b20c63d...](https://github.com/Tarsnap/spiped/commit/e3968941c9c1b20c63d578f1d262d97d5781d240)

3\. In my getopt code, I use a (non-C99-compliant) computed goto in order to
work around a bug in LLVM's handling of sigsetjmp/siglongjmp:
[https://github.com/Tarsnap/libcperciva/commit/92e666e59503de...](https://github.com/Tarsnap/libcperciva/commit/92e666e59503de37b958c7e70e56ae71b281b036)

4\. Many years ago, I added a spurious 'volatile' into some Tarsnap code in
order to prevent a buggy LLVM optimization step from running (it was making
the Tarsnap build hang on on OS X 10.7):
[https://github.com/Tarsnap/tarsnap/blob/master/tar/multitape...](https://github.com/Tarsnap/tarsnap/blob/master/tar/multitape/chunkify.c#L70)

------
nathanielc
Several years ago and I don't remember much of the specifics but we had an
issue with static content being served from our site being randomly truncated
(polluting the cache etc).

We eventually traced the issue down to the Nginx server that was serving the
files and one of it's cache buffer size config options, (I don't remember
which one anymore). We noticed if the file being served was larger than a
certain size it would occasionally truncate the file but not always. We tested
increasing the buffer size by repeatedly doubling the default value, which was
a power of two, up to a size of several GBs. But the files kept being
truncated for some small percentage of the requests. At this point we knew it
wasn't directly related to the size of the buffer since it was larger than any
files being served. Finally someone suggested that we test a value that wasn't
a power of two and the issue was gone.

We figured it was an internal bug in Nginx where it was growing an allocation
buffer and used powers of two, but had an off by one error that didn't copy
the second half of the buffer or something. We dug through the code but never
found anything and so we left the cache setting at +1 from the default power
of two value and never had an issue again.

------
MaulingMonkey
Wireshark let me find out that Unity's WWW class ignored request HTTP headers
on iOS, causing our usage of S3 to fail. I worked around the problem by
switching to URI based authentication.

On-screen keyboards displayed Chinese after visiting a system menu. We freed
the async operation when the system menu "canceled" the keyboard operation (it
wasn't supposed to be even displaying), but apparently the system had a use-
after-free bug. I worked around the problem by switching to a 4 entry LRU
allocator, keeping the past 3 or 4 canceled operations around untouched (1
would've probably sufficed, but I'm paranoid.)

A WinRT API to check internet connectivity would exit(3) our app without error
messages or related callstacks - but only if the Charm bar was open for more
than 10 seconds, assuming you called it once per frame on the main thread. I
had to bisect our history to figure that one out - and repro in a new test app
to confirm it was the real cause.

EDIT: Third party injected DLLs crashed our app at least twice - once for some
monitoring software on a coworker's computer (crashed when closing file
handles as build tools tried to clean up and exit), once for an old Microsoft
Word IME that predated the Win8 app sandbox who's restrictions it was
violating. The monitoring software was uninstalled, the IME I couldn't think
of a reasonable workaround for and left to Microsoft to fix.

------
beamatronic
I used to have a Commodore 64. I had one _specific_ game that would not load
successfully unless my monitor ( a TV actually ) was turned off. So I had to
type "LOAD *,8,1" or whatever, then turn off the monitor, then press RETURN.
I'd turn the monitor back on after the disk drive lights went off.

~~~
abricot
In 1999 I had an old (at the time) Pentium-133 that wouldn't let me reinstall
windows when the network card was plugged in. If I did that the mouse, the
graphics card, the network card, and the secondary harddrive wouldn't work.

If i unplugged the networking card when i installed there were no issues.

------
williamjackson
There was a security camera with a built-in HTTP server at a previous job. The
built-in server would respond without a problem when viewed from one computer,
but would force close the connection without a response when viewed from
another computer.

I used Fiddler to compare the requests from the two computers and eventually
discovered that the request would fail if the `Accept` header was longer than
some value (might have been 255 characters -- I don't remember).

Turns out when you install Microsoft Visio and Project, Internet Explorer's
Accept header gets really long.

------
Dr_Jefyll
In the 1980's I had a client that manufactured cheques, and the typesetting
was done by four or five "Wescode 1420M" systems. These technological marvels
used an 8" floppy drive to input order data -- the customer name, account
number and so on. The output was rendered onto a single web of fan-fold
material which successively threaded its way through _two_ Diablo daisy-wheel
printers. The key point is it was a pipeline, with multiple orders in flight
simultaneously.

Floppy swapping was a normal part of the work flow, and there was an obscure
vulnerability in this regard. In some circumstances if the disk was changed at
an incorrect time it was possible for data to leak between orders. (For
example, Ted's cheques might bear Alice's account number! To call this
intolerable is putting it mildly.) Disk-swap prompts were displayed on a
terminal for the operator's benefit, but the environment was hectic and humans
are fallible.

Did I alter the software so it'd preview the data and verify that every disk
change occured as prompted? No. The 1420M computer featured _three_ 8080
microprocessors mucking around in shared memory, and the code was a spaghetti
monolith written in assembly language. I've reverse-engineered lots of stuff
before -- there are a coupla stories here [1][2] -- but some challenges you
need to walk away from. The time frame would've been open-ended, and that
wasn't acceptable.

What I did was supply the client with a gory hack. No apologies -- it was the
best way to serve their needs! On each 1240M I installed an 8741
microcontroller that monitored program status by eavesdropping on the RS232
line that carried text strings to the terminal. If those messages failed to
agree with observed disk-change activity (relayed by the Door_Open signal on
the floppy drive) the microcontroller would yank the 1240M Reset line low.
This would crash the pipeline and force the operator to reboot -- a
considerable nuisance... and yet, enormously preferable to allowing the error
to go undetected!

[1]
[http://laughtonelectronics.com/Service/Embedded%20Computer/e...](http://laughtonelectronics.com/Service/Embedded%20Computer/embedded%20computer.html)
[2]
[http://laughtonelectronics.com/Projects/uCtlr%20Interfacing/...](http://laughtonelectronics.com/Projects/uCtlr%20Interfacing/uCtlr%20Interfacing.html)

~~~
Senji
"Human in loop" swapping floppies. Byzantine.

~~~
Dr_Jefyll
Yup. And the triple 8080's contributed a lot to the character of the thing,
too.

------
nevir
Found deep in the guts of some shared library at Amazon (many years ago;
probably still exists):

    
    
        #define private public;
        #include "something";
        #define private private;
    

(Not to fix a bug, but certainly a hacktastic workaround)

~~~
imron
Looks like it will produce a compilation error to me ;-)

~~~
stevekemp
Yeah the trailing `;` would likely cause problems.

------
skissane
Many years ago, one place I worked at had the following setup: a closed source
application would generate a CSV file, which was then FTPed to another server,
where a Perl script translated from CSV to fixed-column-width format (which
happened to be identical to the output format of an old mainframe application
that we'd migrated off), and then the fixed-column-width file was FTPed to yet
another server which loaded it into a database. Now, the CSV file had a number
of fields - name, address, etc; but it also had an encrypted password field.
We didn't use the encrypted password for anything, we didn't even know what
format it was in (hashed or reversibly encrypted or so on). The CSV format was
fixed by the vendor and we couldn't change it. However, rather than being
output in Hex or Base64 or similar, the closed source app just put the binary
data of the encrypted password into the CSV file, which would randomly contain
comma or new line characters. The author of the Perl script wasn't aware of
this possibility, so the Perl script would die, complaining it had got an
invalid input line (wrong number of fields), whenever that randomly happened
(sometimes several days in a row, other times it could go weeks without
happening).

I proposed to modify the Perl script to fix this issue. However, management
refused to let anyone modify the Perl script. The guy who wrote it was a
contractor who had moved on years ago. This Rube Goldberg file conversion and
transfer formed part of a critical business process. A couple of years earlier
it had failed, and its failure resulted in bad press and reputational damage.
So they were way too scared to let anyone modify the code of the Perl script.

Instead what happened, was each day a person would manually check if the
script had run successfully the previous night. If it did, they'd fix up the
data issue in the input CSV file using a text editor then manually start the
Perl script again. Management agreed that we could automate that checking
process, and if the Perl script failed they would get an alert on our service
availability dashboard. But no way would they let anyone fix the bug in the
Perl script.

~~~
Senji
I can deduce this place was a bank.

------
cesarb
At a previous company, we had a legacy software written in PowerBuilder, which
crashed on some of the client's computers. We couldn't reproduce the crash on
our computers, no matter how much we tried.

We finally got access to one of the crashing laptops, and (with the client's
permission) installed a debugger on it. After a few false starts, we found
that some code deep within PowerBuilder's framework crashed when it received a
particular accessibility window message, and that this window message was
being sent by some Microsoft touch screen component. All of us techies had
avoided buying touch screen laptops (this was when touch screen laptops were
Microsoft's latest fad), which is why it had never happened on any of our
machines.

The solution was to do a binary edit of the import table of the relevant
PowerBuilder DLL to route all Windows calls to a helper DLL, which forwarded
them to the real Windows DLL after replacing the window message callback with
a small thunk. Said thunk then filtered out the offending window messages,
before forwarding the rest back to the real window message callback within the
PowerBuilder DLL. Hacky, but worked perfectly.

------
rincebrain
A recent encounter was a fix for playing the original BioShock under Windows 7
- the sound would not function more than for the first intro. The trivial fix
is to plug in something to the microphone port. [1]

Another good one - back around the era of the original NVIDIA Ion boards, I
was helping to run a cluster of these boards as an experiment in low-power
computing. [2]

Some ran Linux, some ran Windows. Running CUDA code under Linux headless is
fine, running it under Windows with a non-Tesla GPU was nontrivial at best
(and involved hacking up the Tesla variant of the driver to add some PCI IDs).
Unfortunately, it turns out that this breaks if you don't have an actual
display attached to the machine.

The solution that was implemented was to take 36 naked male VGA headers and
solder resistors across just enough pins to convince the system that there was
a display there, and then install them.

Or the Samsung SMART IDENTIFY hard drive bug - which meant that the advice
"disable SMART to keep your data safe" was sometimes valid. (The drives had a
FW bug that caused them to drop data in the write cache if they got a SMART
command before flushing it.) [3]

I'm sure I'll think of more later.

[1] -
[http://forums.steampowered.com/forums/showthread.php?t=10931...](http://forums.steampowered.com/forums/showthread.php?t=1093192)

[2] -
[http://www.nvidia.com/content/gtc/documents/sc09_szalay.pdf](http://www.nvidia.com/content/gtc/documents/sc09_szalay.pdf)

[3] -
[https://www.smartmontools.org/wiki/SamsungF4EGBadBlocks](https://www.smartmontools.org/wiki/SamsungF4EGBadBlocks)

~~~
bluesilver07
The 'plug-in microphone' fix was common to other games like Call of Duty as
well
([http://forums.steampowered.com/forums/showthread.php?t=21964...](http://forums.steampowered.com/forums/showthread.php?t=2196456)).
From one of the Steam forum entries - "The reason plugging in a microphone
works is because 'Stereo Mix' is automatically turned on when you plug in a
mic."

~~~
Senji
This would probaly not work on some shitty audio cards with shitty drivers
that have explicitly removed "stereo mix" for "copyright purposes".

~~~
rincebrain
Would it? I'm not entirely sure what the origin of the technology involved is,
but as they said, it implicitly enables it even on drivers that don't have it
as an explicit option (like the audio driver stack on my Win7 box, at the
moment).

------
web007
The garret: glen; CSS bug, circa 2006 or 2007.

Starting out with a bunch of existing CSS, a developer added 2 new properties
somewhere in the middle, but forgot a trailing semi between them. Reloading
the page showed the first change, but not the second. He tried ten different
variants of the second property name, spelling, values, etc. and nothing was
showing up. He added another property before the broken one to help debug, and
it started working. He then tried several variants on that to see if it was
some arcane ordering bug, and eventually ruled that out by using two
developers' names for property:value.

Because all of the intermediary versions included a semi, and because the
first property allowed some kind of extended content that was ignored, it took
half a dozen developers looking at the "weird bug" before someone noticed the
missing semi on the first property.

------
djeebus
An registration company I used to work at was using .NET 1.1. Being the super
ambitious junior developer I was, my first move was to upgrade our software to
the latest and greatest: .NET 2.0. After passing all the tests and signed off
by QA, we moved it to production, and we pat ourselves on the back, having
done A Good Thing (tm).

Soon afterwards, however, we started receiving reports of our users not being
able to refund or charge credit cards. All that information should have been
in the DB, encrypted! We quickly discovered that, on occasion, the encrypted
data getting corrupted. Immediately we did what every engineer would do in our
place - blame the previous engineer's code, then try and find the bug that
would prove our theory right. After days of studying source code and testing
theories, nothing explained occassional corruption.

Eventually we traced the beginning of our problems back to our
server/framework upgrade, and found a Backwards Incompatible Change: invalid
unicode code points would now be silently dropped, rather than being allowed.
It turns out that all of our credit card numbers were being encrypted
properly, but then DECODED using the UTF-8 Encoding and stored in an NVARCHAR
column in the DB! Everything was fine in .NET 1.1 (and SQL Server 2000) but
.NET 2.0 silently drops the invalid UTF-8 code points. With those code points
missing, it was impossible to decrypt the data and do anything with it.

... I suppose that makes it more secure though, so there's that ...

We felt that .NET 2.0 was a big enough upgrade that it was worth adding some
new warts to our system. The final hack: we found an unused pc and built a
.NET 1.1 web service with two functions: encrypt/decrypt. We store credit card
numbers in the database in plain text, make a call to this webservice with the
row id, and it encrypts the data. This solution lasted almost 5 years before
our boss accepted the pain of an hour of down time and we
exported/decrypted/encrypted/imported the entire db.

------
ivank
I've got Intel graphics and a 4K monitor on Linux. With the Intel drivers, I
have no vsync (I can't use TearFree because of strange video corruption
issues), but things mostly run correctly. With modesetting drivers, I have
triangular tearing and serious performance issues in Sublime Text, but _do_
have vsync in fullscreen.

My workaround for watching movies with vsync? Use Intel drivers in my main X
session, modesetting in a secondary X session just for mpv.

~~~
bertiewhykovich
Ah, the joys of Linux. Truly the world's greatest operating system.

------
aruggirello
My "favourite" bug workaround is for the KDE Plasma 5 desktop wallpaper
changer which degrades the pictures being used (by blurring, almost ruining)
whenever downscaling them (when they are larger than the desktop's native
resolution), something lots and lots of KDE users are complaining about. There
is no fix released yet but, being a creative user, I resorted to installing
"variety", a very cool desktop wallpaper changer (and downloader).

As Variety can apply ImageMagick filters on the fly to the wallpaper being
set, I set it up so that it just scaled down and cropped the image to my exact
desktop resolution. This fixed the issue for me... at least, temporarily :)

To set up the filter, I edited the ~/.config/variety/variety.conf, and changed
the line:

    
    
      filter1 = ...
    

to

    
    
      filter1 = True|Keep original|-scale '<my desktop resolution, eg. 1920x1080>^' -gravity center -extent <my desktop resolution, eg. 1920x1080>
    

Then I configured Variety to generate a single wallpaper file in a folder
which is "watched" by the KDE Plasma desktop wallpaper changer, with the same
interval. Voilà!

------
lscharen
Not a "real" problem on a running system, but back in my first year of
undergrad I had a computer science assignment that kept faulting with an
"Illegal instruction" error on our Solaris systems.

I had a C compiler on my personal computer and the same program ran and
compiled fine there, but we had to submit our solution in source code form on
the Department's shared system to plug into the class' automated build and
test scripts.

Eventually, I discovered that adding an extra space to a comment fixed the
error. I wasn't experienced enough to at the time to know how to use GDB to
disassemble and debug binaries, but, looking back, I think I must have
triggered a compiler bug that misaligned an instruction (Sparcs were 4-byte
aligned, IIRC) and adding the extra space somehow fixed the alignment of the
generated code.

~~~
grkvlt
Sadly I don't think that's true. A first year CS undergraduate would not be
writing code that triggered a compiler bug, the real problem was most likely
your code.

I suspect you had an error in your program, an off-by-one or other type of
overflow, that caused the stack to be executed. Compiling without debug would
mean that the code executed was harmless, compiling _with_ debugging symbols
(the -g option in gcc) enabled caused a different memory layout, which
triggered an attempt to execute data that contained an illegal instruction.
Since in debug mode comments are included in the data segment, adding a space
to a comment further changed the memory layout making the error innocuous
again.

// _EDIT_ After thinking about this a bit more, I'm not entirely convinced by
my explanation since comments aren't included in the debug symbols. However, I
still think it's _more likely_ that a debug (versus optimized) build had
different memory layout, and therefore different behaviour in the presence of
a stack/heap smashing bug...?

------
willvarfar
Brings to mind this absolutely classic old story:

[http://thedailywtf.com/articles/ITAPPMONROBOT](http://thedailywtf.com/articles/ITAPPMONROBOT)

And pics of a build it inspired:

[http://thedailywtf.com/articles/The-Son-of-
ITAPPMONROBOT](http://thedailywtf.com/articles/The-Son-of-ITAPPMONROBOT)

~~~
kjetijor
An friend of mine had a similar thing, where desktop-box-turned-server
essentially locked up after just over 24h of uptime. Solution: Outlet/timer-
thing which cycled power around 2am when nobody were looking.

Similarly - there were some minor issues with the cooling for my compute
cluster at my previous job, where it weren't really designed to function in
climates which had temperatures that varied too much. Notably, it'd turn off
the compressors on hot summer days and cold winter days. While waiting for the
tech, tiny rocks found on the roof were used in conjunction with some tape to
force the mechanical relays on while waiting for the techs.

[http://www.pvv.ntnu.no/~kjetijor/images/tape_rocks.jpg](http://www.pvv.ntnu.no/~kjetijor/images/tape_rocks.jpg)

~~~
ajford
A few years back, I was part of a group in the early days of commissioning a
piece of research equipment that consisted of many racks of FPGA and GPU
computing equipment in a specially modified shipping container. This thing was
installed in a desert area, and had to be cooled by a couple of AC units.

The issue was similar. On nights where the temperature dropped too close to
the dew point for too long, the units would freeze over. However, at the time,
there wasn't any temperature monitoring. So someone figured out how to monitor
the die temp on the FPGAs without changing the running code. Took them a few
days. By the time they finished, someone realized they could tie streamers to
the AC vent, which could be seen in the remote video stream.

Anyways, the fix was to connect to the network, switch the AC unit to fan only
for a couple of hours, then switch them back on. If I remember correctly, it
was like this for about 6-8 months before they finally had someone replace the
AC system with a more commercial unit that could handle the condensation.

------
sethammons
Not mine, but a classic. Emails that can only be sent 500 miles:
[http://www.ibiblio.org/harris/500milemail.html](http://www.ibiblio.org/harris/500milemail.html)

------
existencebox
Christ, I should be keeping a list over the course of my career, I'm sure I've
forgotten some gems.

Some that stand out: We had a NOSQL-esque backend that stored CSVs, as part of
a data pipeline. (CSV in, data "Activity", csv out). You specified the file,
if it had headers, separator, etc. As it turns out, you could not define a
null separator, if you wanted to have a single column file. I needed something
that would properly split what I knew to be well formed all alpha-numeric
inputs within the valid ascii range, and would avoid spurious splits. The sep
I used was naturally (the snowman unicode character, unicodesnowmanforyou.com,
which as it turns out HN sanitizes on posting!) (The punchline comes when I
started seeing this pattern show up in production code elsewhere in the
company, using this exact same character choice.) Snowman separated files++
(.ssv?)

Another fun bug, was working in a very large platform that had a common
telemetry library that used perf counters. The original authors, and all of
the platform authors consuming the lib, had gone on their merry way without
realizing that perf counter instances have a disallowed character set, which
the custom lib was _embedding by default_ when it added metadata to the
instance name (#foo or something IIRC). Fixing the metadata appendation was
easy enough, but to fix every place where the consumers had named something
with an invalid char (and then consumed with the same invalid char on the read
side) ended up writing a shim that sat between the perf counter lib and world
and silently character replaced the invalid chars with something strange like
_<charID> (Basically reinvented the wheel of slash escaping but within the
perf counter allowed charset).

And to end on an abysmal note, large project had VERY consistent naming
scheme, had gotten quite deep filewise, was hitting max path len limitations
on windows. Rather than break the consistent naming on a new, slightly longer
file that needed to be added, or rename everything else, changed root paths
from Workspace->w; Main->m; Release->r, etc. I am not proud of this one...

Even as I type this I know there are tons of hacks I'm forgetting (using
plastic knives as hard drive stabilizers in a significantly sized datacenter
deployment) and will gladly expound if there's interest but for now I'll let
this nostalgia get reburied :)

~~~
mdip
I always wondered why the Unit/Record/Group separator characters were
virtually never used. In the case of human editable files, I get it (a comma
is actually _on_ the keyboard, after all). But I'm curious, in your case, why
you went with the snow man over the built-in options[0]? (and I have to admit
that I got a laugh out of the "pattern show[ed] up in production code
elsewhere in the company" \-- I've seen that _so_ many times)

[0] [http://stackoverflow.com/questions/8695118/whats-the-file-
gr...](http://stackoverflow.com/questions/8695118/whats-the-file-group-record-
unit-separator-control-characters-and-its-usage)

~~~
existencebox
An exceedingly stupid act of paranoia; I knew the input _could not_ go above
the normal ascii character set without errors elsewhere in the pipeline, it
seemed therefore more robust to chose one that could by other invariants never
be hit. That being said your group separators, had I thought harder about it
might still have been a more valid answer. (but then I wouldn't be able to
talk about it as quite so much of a dirty hack!) I imagine they aren't used
much because frankly I hadn't even thought about their distinct function more
than two to three times in my entire post-programming life.

------
gwbas1c
Windows only allows a limited number of Explorer icon overlays installed. If
you install a lot of programs that install Windows icon overlays, some stop
working.

There are ways, though, to make sure that your icons have priority over "Joe's
poorly designed explorer plugin." :)

~~~
mdip
Reminds me of the maximum PATH length issue still present in most versions of
Windows (I think Windows 10 Anniversary resolves it).

It was particularly painful because when you'd hit it (by, say, installing
Sybase drivers or some other awful application that insisted on putting nearly
every subdirectory it had in PATH), nothing would tell you that it was
specifically the PATH being truncated that was at fault, you'd just get a
large number of applications that would stop working and return obscure error
messages.

~~~
rincebrain
Windows 10 Anniversary has code to resolve it, but it's opt-in.

[http://winaero.com/blog/how-to-enable-ntfs-long-paths-in-
win...](http://winaero.com/blog/how-to-enable-ntfs-long-paths-in-windows-10/)

------
CatsoCatsoCatso
Microsoft Excel 2003 (or at least the copy I was stuck with) had a weird bug
where if the final column of a CSV spreadsheet with headers was empty (column
header there but no data) then the outputted CSV file would only contain the
correct amount of commas for affected rows up until the 16th line before it
just started discounting the commas to indicate an empty field at the end of
the table.

This would cause all sorts of errors with the program I had to upload the
files to.

My only work around was a series of Regex based find and replaces in
Notepad++, I could have perhaps scripted something automatic but I was on a
very locked down machine at the time.

It was one of many weird MS Office bugs I had on a A3 sheet pinned to my
cubicle wall.

------
danbruc
My favorite is not a bug workaround but for a limitation in the GUI library
used.

I worked on an enterprise job scheduler that was initially outsourced to an
Indian company but the project started failing and so we took back
development. The software was required to be able to schedule tasks with a
delay of up to a hundred or so hours but the GUI library only had a control
for time of day up to 24 hours. The code we received had an interesting
solution - they changed the format string to place the milliseconds part first
and then some code in the data access layer that swapped hours and
milliseconds back and forth on reads and writes. And there you have it, delays
up to 999 hours.

------
santialbo
In windows 10, resizing the command window would break npm
[https://github.com/npm/npm/issues/12887](https://github.com/npm/npm/issues/12887)

Workaround, not resizing the command window...

And response from someone in Microsoft:
[https://github.com/npm/npm/issues/12887#issuecomment-2225253...](https://github.com/npm/npm/issues/12887#issuecomment-222525339)

~~~
Senji
Tried reproducing it with TCC/LE or ConEmu?

------
avh02
At a previous job I was asked to debug a large (inefficient) cronjob that was
suddenly taking 24+ hours instead of the usual ~8 hours. (We had just migrated
infrastructures but noticed this days later)

Being relatively new to that particular codebase I look at it and see nothing
that stands out to me... after an unfruitful day and not wanting to get too
deep in to the code without necessity I fired up a profiler. Logging (syslog)
statements were taking HUGE swathes of time. Neither me nor the person
supervising me could believe that was it so we put it on the back burner.

The next day I take another look at the log statements, fire up a python shell
and find the logging statements on that server are returning instantly 4/5
times. Every 5th (or so) time it would block for 5 seconds or more. Given the
cronjob writes thousands of log statements in the course of action, this
became a cause of concern.

Didn't manage to look in to it deeply enough (I guess DNS caching plus crappy
DNS) but the quick workaround was to toss the syslog server's address in to
the hosts file, the cronjob ran 'smoothly' after that.

------
atom_enger
I remember working as a help desk tech and our company used ACT the CRM
software. At the time it was very poorly designed(might still be) and used an
MSSQL database to store all of it's information. We wanted to port all of the
information in the DB to a web app that would allow us to do different stuff
with the data that ACT wouldn't allow us to do(number crunch, send email
reports, etc). Part of the problem was that an ACT install automated the MSSQL
part of the set up and set the root(i forget what they call it in mssql now)
with a password so you couldn't see any of the internal tables. I remember
spending that night after everyone went home learning how to shut down the
database and force a reset on the root user so that we could add a user that
could get read access on all the tables.

Everyone had been talking about getting at that data for a year or so and one
night I was just like fuck it, I'll give it my best shot. Honestly it wasn't
that impressive, but I certainly do remember how cool it felt to tell "the
man" to F off and this was our data :).

------
cakes
The best/closest I have is that where I once worked, we had a NetApp that
allowed it to be upgraded to a version that it didn't support (it wouldn't
boot) which was not how it was supposed to be... Anyway, we should've been
able to fallback but the jump we tried to make screwed with paths to the
bootstrapping/startup and while normally the previous version should be
recoverable...well it was not because of where the upgrade process failed.

So we were trying to recover it and I had a "It's a Unix System, I know
This!"-moment and was able to manually type in the path to the previous binary
during an emergency/rescue prompt (based on deductions from forums, the
current failed loading message, and some obvious things like architecture) and
got it up and going again.

Documented that, internally, to the best of my ability.

------
slm_HN
This is a little different, but I always think about it when someone says bug
workarounds. It's literally a bug workaround from an unknown coder back in the
days of BASIC...

    
    
        390 ...some basic code here...
        395 GOTO 405
        400 REN HOUSEKEEPING
        405 ... more basic code...

------
mirkules
Most recent one is a bug in lubuntu based on 16.04 where the mouse cursor
disappears after system goes to sleep (but is still functional).

Workaround is ctrl-alt-f7 to switch to console then ctrl-alt-f1 to switch back
to GUI, and the mouse cursor reappears.

[https://bugs.launchpad.net/ubuntu/+bug/1573454](https://bugs.launchpad.net/ubuntu/+bug/1573454)

Another one is a sweet widget in OS X called iStatPro, which was no longer
working ias of Mountain Lion. But, there is this workaround which for me still
works on El Capitan: [http://hints.binaryage.com/istat-pro-for-mountain-
lion/](http://hints.binaryage.com/istat-pro-for-mountain-lion/)

------
reacweb
We needed to print a log file on a VMS station, but the end of the file was
never printed (11 pages instead of 17). The file was containing many '%'
characters. I have suggested to replace them by '#'. This has solved the
issue.

------
malkia
Not entirely the same kind of workaround, but an overly genius way to get game
patching on PS2 through self explotation. From Insomniac:

[http://www.gamasutra.com/view/feature/194772/dirty_game_deve...](http://www.gamasutra.com/view/feature/194772/dirty_game_development_tricks.php)

Also this on their site (but requires flash):
[http://www.insomniacgames.com/self-
exploitation/](http://www.insomniacgames.com/self-exploitation/)

~~~
voltagex_
Unfortunately the swf doesn't seem to be there: Failed to load resource: the
server responded with a status of 404 (Not Found)

[http://web.archive.org/web/20160310003012/http://www.insomni...](http://web.archive.org/web/20160310003012/http://www.insomniacgames.com/self-
exploitation/) has it, though.

~~~
Senji
That's just a powerpoint presentation in flash form.

------
drakonka
Just had one. Not as strange as most of these but annoying. We have a custom
P4V tool which often needs to be run simultaneously for two different
changelists via the changelist context menu. However, we noticed that after
the first instance of the script finishes on the first changelist the second
one running in parallel exits along with the first, never finishing the work
for the second changelist. I noticed that if you terminate the second started
instance the first is unaffected, only the other way around.

At first I thought it was something wrong with handling multiple processes in
our tool, or some weird multiprocess tkinter or cx_Freeze issue. Then I
realized that starting these two instances of the script from two _separate_
p4v windows resolves the issue and they can run at the same time, not
hindering each other. But we can't ask users to have multiple P4V windows open
just to run this on multiple changelists.

The workaround, for now, is having the custom tool run a batch file instead
which then runs the frozen python app exe, ensuring that each actual instance
of the tool starts in its own parent process and not as a p4v subprocess.

------
MzHN
In a map project, I had markers stored in PostgreSQL + PostGIS database.

As the amount of markers got too heavy for the browser, I tried only querying
markers within a certain range of a coordinate I was visualizing.

For some reason, no matter what coordinate systems, data type casting and
PostGIS functions I tried, I would always get an ellipse-shaped area of
markers, where the north-south distance was twice the expected, and the west-
east distance was as it should be.

As I realized that the issue was consistent, and always exactly double, I
decided on a crazy workaround: I added math to the distance query, to divide
the latitude coordinate by 2 and then order the results by distance and LIMIT
1000 closest markers this way.

Voilà, perfect circle on the map!

Even though the resulting coordinates were completely off, it did not matter,
because only the distance comparison used the wrong coordinates.

------
imron
Not exactly a bug workaround, but in a similar vein:

[http://www.gamasutra.com/view/feature/132500/dirty_coding_tr...](http://www.gamasutra.com/view/feature/132500/dirty_coding_tricks.php?page=4)

Scroll down to 'The Programming Antihero'.

------
shermanyo
Our team was porting our middleware product to an appliance environment
(stripped linux os, hardened image).

We had a config script that we used internally for test environments, and were
hoping to use it on box until our code covered this part of the setup process.

It relied on starting several services in order, and checking certain things
were running at various points, by parsing the output of 'ps'.

Unfortunately, the appliance used a BusyBox version of 'ps' that truncated the
output.

I ended up writing a shellscript that checked /proc manually and echoed a
string that would match the main offenders, aliased 'ps' to the new script,
ran the setup and it worked first time.

I used it on our nightly test runs for ~ 3 months without issue, until it was
properly replaced.

------
shermanyo
For the last couple weeks, I've been doing some work through the following
chain:

\- Windows VM (to isolate VPN connections)

\- RDP to a Windows VM (jump box in the cloud network)

\- VMWare vsphere client (to perform the initial appliance iso installation)

The bug I've encountered: the first keypress is echoed several times, while
keys typed immediately after are only sent once. Any short (< 1 second) pause
in input will cause the next keypress to echo several times again.

Leading to input like the following:

login> rrroot

password> pppassword

My workaround to get through the initial configuration (so I could ssh)
involved remembering to press/release shift before I typed anything. (on
screen keyboard also worked, but where's the fun in that?)

It ended up feeling like the habit of tapping esc before entering a command in
vim :P

------
mdip
This isn't a software bug, but since a lot of these aren't, I thought I'd
share because it was a fun one with an unexpected cause.

I worked on a floor with about 10 people that was entirely occupied by phone
switch equipment (raised floor, wires/racks, Halon fire suppression and big
enough to seat several hundred people were it not for the equipment). For two
weeks, about every 3 days or so in the middle of the night, the power would
cut off. This was particularly surprising since the entire floor had dedicated
battery and diesel backup (regularly checked/tested) and they never kicked on.
Our facilities guy was going bald troubleshooting it -- brought in
electricians and had the techs checking _everything_. There was just no
explanation.

In a last ditch effort to try to get some information, he setup a laptop with
a built in webcam and placed it high enough in the air so as to get most of
the site[0].

A little history is necessary for the facility's design to make sense. At one
point this room was used for our mainframe -- we were a local phone company
and had a _ton_ of data. This necessitated having a very elaborate near-line
storage device custom built for the company. It consisted of a multi-million
dollar robot (the exact kind you see on commercials building CARS, an arm
about the size of an adult man coming out of the floor which ran on a track
from one wall to about the middle of the space). It was enclosed in glass and
would move tapes from a large shelving unit into drives and back but it was an
open loop system: it never truly knew if it got a tape or if the tape made it
to the drive and back and being an imperfect mechanical device, every once in
a while it dropped a tape and someone would have to disable it, go in and pick
the tape up off the floor (or, more often, the pieces of what was once a tape
in some cases). This robot moved _very fast_ and was _very powerful_ so in a
scenario where it's a person vs. "big moving robot"... well, there'd be pieces
of person on the floor instead of tape. Since we liked our employees (and OSHA
probably mandated it), the interior of the robot housing was filled with
exposed "big red buttons" that would cut the power in an emergency. The
exterior walls of the switch room had the same switches, though these buttons
had a large acrylic cover with a hole in it so that you couldn't
_accidentally_ power down anything. A choice few of these killed power to the
entire site and had a sign indicating that with something along the lines of
"OH PLEASE GOD DON'T TOUCH THIS BUTTON"

Janitorial staff had been used to turning the lights out on their way out if
they were left on and a new member of janitorial staff discovered, at some
point, that hitting that big red button took care of all of the lights at once
(along with all of the normally blinking LEDs on the thousands of switch
cards, but hey -- it got dark at least!). So on his way out the door, he'd
walk over to it, look at it for a second, then push it ... powering down ...
everything.

The workaround was easy. We were now responsible for taking care of our
garbage, dusting and cleaning from that point forward (which I think during my
7 or so years on that floor happened _once_ ) and a permanent camera was
installed in the ceiling which was powered on a circuit not affected by the
buttons. The buttons remained, though.

[0] I think after ruling out everything else he had suspected sabotage of some
kind was responsible. Our doors used RFID badges and visitor logs were
accessible, but at that time the doors that were interior to our office space
didn't require badge access and there were no entries for the doors that one
would have checked.

~~~
Piskvorrr
First thing that came to my mind - check the logs. What, no access logs for
critical infrastructure, no physical access control, "anybody could use the
door, no biggie"? I had a hunch about your issue from sentence 3 onwards - I
thought the story "janitor unplugs server, plugs in vacuum, replugs server
when done" was universally known. Apparently, "those who don't know history
are doomed to repeat it." ;)

~~~
mdip
Yeah, that was the painful part. Almost nobody had access to that entire suite
and those that did underwent stringent background checks and were very
technical, so physical security once you were in the suite was limited.

IIRC, I believe it was discovered that the janitorial staff used building keys
rather than the RFID locks so they weren't even logged when they arrived in
suite.

I was a little surprised that hitting the emergency button didn't trigger an
alarm of some kind, but that's how it was installed in the 80s and I'm fairly
certain it's still that way, today (though I don't work there any longer).

Outside of those omissions, things really were kept in order: monthly battery
tests, quarterly diesel/full system and disaster recovery tests. It's right
when you think you have a solid process that someone comes along and pushes
the wrong button, or burns some toast and triggers a floor
evacuation/unexpected Halon test (that happened, too -- at some point they
took away all of our nice things).

------
sethammons
Not really a bug, but I just ran into this. A linter for Ruby that only wants
double quotes if there is string interpolation and prevents builds from
triggering. Never mind if you want to avoid escaping single quotes for
readability. Here is a work around ;)

    
    
        fuck_linters = ''
        linted_string = "${fuck_linters}don't stop apostrophes"

~~~
lloeki
> A linter for Ruby that only wants double quotes if there is string
> interpolation and prevents builds from triggering

Is that Rubocop? Put this in `.rubocop.yml`:

    
    
        Style/StringLiterals:
            EnforcedStyle: double_quotes
    

More here:
[https://github.com/bbatsov/rubocop/blob/master/config/defaul...](https://github.com/bbatsov/rubocop/blob/master/config/default.yml)

~~~
sethammons
It is Tailor. Thanks for the info.

------
deadlyllama
In the late 2000s I worked for a small NZ company, Innaworks, who developed a
tool to automatically port J2ME mobile phone apps (mostly games) to BREW,
Qualcomm's C++ environment for phones.

The number of handset bugs we had to work around was immense. One handset, the
Samsung A790, would reboot if you drew text on an offscreen buffer. Another,
the Samsung N330 which we nicknamed the "shaver phone" for obvious reasons[1],
ignored a few least significant bits of the source x coordinate when you did a
bitblt from an offscreen bitmap to the screen, IF the offscreen bitmap had
fewer than 4 bits per pixel.

We ended up writing our own graphics code that wrote into the BREW backbuffer,
set the damage rectangle, and asked BREW to blit that to the screen for us.
This was much faster than the BREW runtime's graphics code, so games ported
via our automated system often ran faster than "hand-ported" games.

The LG AX260 would crash with an error screen if you used threading -- I
suspect an ISR would notice the stack pointer was in the heap and halt the
phone. This was a BREW 3 phone, and BREW 3 actually had a threading API, so we
thought maybe the solution was to use the real threading API instead of
setjmp/longjmp. No, BREW 3 threads froze the phone too. We worked around the
problem with some help from memcpy and some rather evil stack pointer
manipulation. Our stacks were pretty small as all Java objects were allocated
on the heap, so this wasn't as bad a performance issue as you'd think. I
refactored the scheduler to avoid stack copies if it decided to keep running
the current thread.

The worst bug I remember, though, was in the ARM RealView C++ compiler. It
optimized out a null pointer check -- you could write to the logfile the
pointer value, write to the logfile from the exception throwing code ... which
never ran. I eventually got the compiler to generate an assembly listing for
the function in question and discovered that no null pointer check code was
there. One volatile keyword later and we were back in business.

Our customers loved the product because it just worked. We supported full Java
semantics, all the way down to static initializer ordering. It was a simple
choice to make -- the more robust our system was, the fewer support incidents
for us and the happier the customers. We produced human-readable C++ code so
you could run your app in a debugger if need be, and did some clever whole-
program optimization. Our runtime was a real memory miser as a "400k" Java
handset would have 400k of heap -- code and images tended to live outside
that. We could compile a game for a 400k Java handset to run on a 400k BREW
handset -- 400k for our runtime, the user's code, the heap, image data, audio
data... I vividly remember the time I saved a whole kilobyte of RAM -- that
was a major win.

I worked with the smartest people I ever have at that company. I've never been
in an environment where everyone was just brimming over with technical
adeptness. And we weren't just a company of young things, there were a few
over 40s there too.

[1] [http://www.cnet.com/au/products/samsung-sch-n330-verizon-
wir...](http://www.cnet.com/au/products/samsung-sch-n330-verizon-
wireless/review/2/)

------
mdip
The most painful bug I encountered had to do with a visitor access kiosk I had
designed and written the software for at my previous company. The workaround
was to block access to a set of sites _for the entire company_ to keep the two
kiosks from failing.

About every few months the web cam would just ... randomly stop working. This
would cause the kiosk application to crash while attempting to take a visitor
badge photo of the visitor, rebooting the machine. Because of the nature of
the device[0], it was very difficult to identify the root cause and the fix
was to physically visit the kiosk, unplug the web cam, remove the driver,
install the latest driver and plug it back in. Eventually, I took some time
and set-up one in my office and watched it.

Something odd about the web cam was that the driver would _never_ work if the
web cam was plugged in while the driver installation ran. The installer
instructed _clearly, on a separate page of the installer_ to unplug the camera
before "proceeding" and in what I have come to believe is one of the _dumbest_
designs for driver software, it would periodically look for updates over the
internet and _silently install them_ , yielding a completely broken web cam. I
spent about a month's time diagnosing the problem, mainly because that wasn't
where I expected it to be, since I had other, more likely targets[1] (and I
hadn't handled the OS install/driver setup).

To make matters more entertaining, the guy who maintained the hardware had
added the hostnames and IPs to hosts and configured it to resolve to 127.0.0.1
but the driver service helpfully ignored that file (as far as I was told[2])
and turning on the corporate firewall (Symantec Endpoint Protection) caused
blue screens. Since this driver started feeling a lot like fighting malware,
we ended up attacking it as such and shut down all communication with the
updates servers and IPs via the perimeter ... for the whole company[3].

[0] It was an Office Communications Server solution written in a _very_ old
API and the kiosk ran Windows XP, which we stripped of nearly everything and
forced the device to use the kiosk application as its shell (which would boot
itself if it encountered any problem).

[1] I had to write a component for the software to work with the web cam in
C++, a language I hadn't touched in years, so my gut feeling was that it was
related to that component.

[2] It could be that we missed some of the IP addresses it polled, or it could
be that it just ignored the hosts file in windows. I didn't do this work so
I'm not entirely sure.

[3] For whatever reason, security wouldn't/couldn't block the IPs just for the
kiosk itself (something about it being setup to not require authentication to
access the internet and our perimeter proxy server -- at the time -- being
unable to be configured to block specific external IPs for specific internal
IPs. My bet is that it was more a "not willing to" than an "unable to", but
who knows?). The practical upshot is that we had some of these devices on
peoples' desks within the company and they experienced the same problem so
once it was banned, we received fewer help desk calls for broken web cams.

~~~
LgWoodenBadger
Why couldn't you have just gotten different web cams?

~~~
mdip
Sounded like a logical thing to do, and we thought of that as well. There were
really two reasons: The "dumb" one was "Corporate Standards" which always
seemed to serve as a method to ensure the worst possible product was forced
upon everyone. I could have worked around that with a bit of political effort.

The bigger reason, though, was that the code was written targeting specific
vendor APIs. Other cameras simply weren't compatible with that code and it
would have been a bigger nightmare working that out, unfortunately.

------
hga
Search for SimCity in this item:
[http://www.joelonsoftware.com/articles/APIWar.html](http://www.joelonsoftware.com/articles/APIWar.html)

It's kinda strange for an OS to be maintained for a long time with that style
of backwards compatibility....

~~~
angersock
Only if by "strange" you mean "fucking lucrative".

A lot of folks are used to the crazy ship-all-the-time-regardless-of-cost
world of webdev, but there is a lot of business value in not breaking things
randomly.

