
Ask HN: What's the hardest problem you've ever solved? - levlandau
We&#x27;ve all toiled and eventually solved some problem. Would be awesome to see some of the toughest problems HN community members have solved. Could be from any domain though it&#x27;s better if the problem is somewhat technical.
======
onion2k
This is exceptionally hard to answer because for every problem I solve I tend
to end up looking at my solution and thinking "That wasn't so hard. Why did it
take me so long? Am I bad at this stuff?"

To answer the question though, I think probably writing a robust web scraper
to search events listings and turn them in to a sharable calendar. It'd be
trivial these days but I did it in 1999 in Perl with regexs.

~~~
zallen
> "That wasn't so hard. Why did it take me so long? Am I bad at this stuff?"

Hah. Always. Hindsight bias and impostor syndrome are a fun mix! I remember
writing a blog suite (with comments!) in Perl in the late 90s; back then,
without S.O. and other knowledge-sharing beyond some Usenet forums, inventing
the wheels as we went along... it was _all_ hard.

~~~
stevoski
I coded the same thing (actually an online magazine with comments on articles,
a forum, and a form for signing up for email updates), around the same time. I
used classic ASP. I shudder to think of all the security holes I must have
had.

------
vatys
Debugging a "hot CPU" boot failure issue. A custom motherboard design would
only boot when the CPU was cold, like ICE cold (put it in a freezer, or hit it
with cold spray).

Turns out the bias current node (external RBIAS resistor sets bias current)
for PCIe was routed too close to an inductor for a power rail. When the CPU
was warm, the power rail pulled more current, causing the inductor to ring
more, causing the crosstalk on the bias net to screw up the PCIe subsystem and
hang the CPU.

Found the issue accidentally on a layout change. Had to prove it by drilling
out the via and re-routing the signal with wire.

------
paraschopra
When I was in initial years of programming my first hardest problem was
implementing backpropagation in VB 6.0
[http://paraschopra.com/tutorials/nn/](http://paraschopra.com/tutorials/nn/)
(2003)

Then the next hardest problem I found was implementing Genetic Programming in
Python (year 2005)
[http://paraschopra.com/sourcecode/GP/index.php](http://paraschopra.com/sourcecode/GP/index.php)

It was fun but extremely hard for me (at that age!).

After that in 2008, I think the trickiest part for me was to write initial
visual editor for Visual Website Optimizer. It involved reverse proxy and
inserting JavaScript code into that reverse proxied page, letting the user
visually edit the page contents.

Fun days. These days I hardly get to code, though last year I gifted my wife a
website ([http://wowsig.com](http://wowsig.com)) which was super fun.

------
doiwin
Turning myself from a data driven nerd into an emphatic person who understands
social interactions. Finally, I bring home girls :)

~~~
hueving
You can be emphatic and remain empirical. It's the difference between thinking
nobody is wrong vs knowing someone is wrong but it not mattering under certain
contexts.

~~~
doiwin
Yes, that is part of it. I pretty much abandoned the thought of someone being
wrong and someone being right. We all have this bias, that we are more
intelligent then others. While on average, we are just average.

~~~
proveanegative
>I pretty much abandoned the thought of someone being wrong and someone being
right.

Frankly, this sounds like taking a good idea too far. People around you will
make assertions about objective reality that are beyond any reasonable doubt
incorrect, sometimes dangerously so. Considering them wrong can be important
to prevent harm to yourself, your family and your job or business. Beyond
that, calling out people you care about on their being wrong can prevent harm
to them and their respective families and jobs even if at first they hate you
for it.

Empathizing with people who are wrong for understandable reasons but still
being keenly aware that they are wrong seems to me like a much better long-
term strategy.

------
pbiggar
Hardest technical problem would have to be my PhD. I worked out how to apply
alias analysis to PHP (previously people had figured out how to do it for the
easy subset of PHP - I extended it to the entire language of PHP 5.3 or so).
[1]

Along the way I'm pretty sure I also figured out how to build SSA form such
that you have your alias analysis results available to be used at SSA
construction and therefore redundant computation [2]. I never got to chase
that down but it was really interesting.

[1] [http://paulbiggar.com/research/#phd-
dissertation](http://paulbiggar.com/research/#phd-dissertation), esp chapter
6.

[2]
[http://paulbiggar.com/research/#fit-2009](http://paulbiggar.com/research/#fit-2009)

~~~
kzisme
Interesting read! Thanks!

------
_chendo_
Going to list a few interesting problems that were easy to solve but the
identification of the issue was hard.

* A test suite we wrote for a client's project before a massive refactor was stalling randomly, but would continue when you tried to diagnose the problem. Turns out their user creation code used /dev/random, and the system was running out of entropy and so the code was blocking. Moving the mouse or typing on the keyboard would add entropy, thus cause the tests to resume. Fix was to to use /dev/urandom for tests.

* Found a weird issue with an embedded network stack where a limited broadcast packets to more than 3 devices would result in only a response from a few of them, but directed packets to each device would work fine. Devices reported successfully receiving and transmitting when monitored over serial console. Issue turned out to be a bug in the ARP implementation where it would incorrectly store any ARP response it saw (rather than ARP responses the device requested). Given the embedded system has a limited ARP cache due to memory constraints, when multiple devices wanted to respond, they would all send ARP requests, and the responses would flush the ARP cache, so when the network stack wanted to send the response, it didn't know what MAC to use and just drop it on the floor. A workaround was to increase the ARP cache size.

~~~
TeMPOraL
> _Turns out their user creation code used /dev/random, and the system was
> running out of entropy and so the code was blocking. Moving the mouse or
> typing on the keyboard would add entropy, thus cause the tests to resume._

Funny how this goes completely against the typical operant conditioning a user
undergoes when working with computers. Usually if your software hangs up, you
want to touch nothing and let it finish. But in this case it's actually
additional user activity that's needed.

~~~
Tiksi
It seems perfectly in line with what I generally see. A hang up usually
results in some desperately mashed combination of Esc, Space, and Enter, then
clicking on absolutely everything, and finally mashing ctrl-alt-del in the
hopes of something happening. The let it do its thing and wait crowd has
always been on the higher end of the technical knowledge spectrum.

------
odabaxok
Debugging can be hard sometimes, too. Here is a Quora topic about it:

[http://www.quora.com/Whats-the-hardest-bug-youve-
debugged](http://www.quora.com/Whats-the-hardest-bug-youve-debugged)

My favourite answers:

Crash Bandicoot: [http://www.quora.com/Whats-the-hardest-bug-youve-
debugged/an...](http://www.quora.com/Whats-the-hardest-bug-youve-
debugged/answer/Dave-Baggett)

Flash Player: [http://www.quora.com/Whats-the-hardest-bug-youve-
debugged/an...](http://www.quora.com/Whats-the-hardest-bug-youve-
debugged/answer/Amir-Memon-1)

500-miles email:
[http://www.ibiblio.org/harris/500milemail.html](http://www.ibiblio.org/harris/500milemail.html)

~~~
abrookewood
500-miles email is a fantastic read - thoroughly recommended.

------
kabouseng
An embedded CPU would sometimes latch-up when going to sleep. This wasn't seen
during development, only reported in the field with very little in the error
report, just something like "It stops working, reboot it works again"

Eventually we found the internal JTAG pull up resistances would not be
sufficient at certain voltages / temperatures. So it wasn't latchup in the end
but the JTAG would halt the processor.

We only found it after days of testing in an oven cycling temperature while
stimulating a coil (RF field) close to the device while varying the supply
voltage to cause the condition.

All the while the client is not happy that his devices randomly stops working,
so we were under quite a bit of pressure.

And of course we only started looking at the hardware after we spent quite a
bit of time thinking it was a software bug somewhere.

------
rcaught
I remember a defining struggle in the late 90's, while I was in high school
and teaching myself how to make webpages.

My problem was a lack of javascript logic firing and the answer was to wait
for document load. Simple, I know, but I had nobody around me (physically)
that could help me and explaining the problem to people in forums seemed
impossibly abstract, primarily because I did not understand what the problem
was. It was the context around my code that I had to fix, but I kept looking
in the code itself.

That was probably one of my first big "ah ha!" moments and these moments are
one of the reasons I still love programming. Tenacity, luck and skill became
irrevocably connected that day.

I've solved many other problems over the years, far tougher than this one, but
maybe never tougher for me in a relative scale. If I had never solved that
problem, I sometimes believe my life path would have been totally different.

~~~
c22
Limited access to informational resources can greatly hinder solution time.
Some of my hardest challenges were implementing content for a MUD with a
sparsely documented custom scripting language. Even with input from other devs
there was a lot of trial and error and hacky workarounds.

Nowadays most of the problems I face have been solved by someone else in a
slightly different context and searching for/implementing existing solutions
is almost trivial.

------
Joeri
One of my first professional challenges was porting a CAD viewer to flash. The
hardest part was figuring out how to convert ellipse sections and AutoCAD
bulge arcs (line + bulge factor) to quadratic bezier curves. That one took
three weeks of figuring out the math (starting from near zero because i hadn't
paid much attention during school). I only completed the task through sheer
stubbornness, because there were whole days where I made no progress at all.

~~~
annnnd
Genuinly curious: I thought you can't represent ellipses with bezier curves,
at least accurately... What did you do?

~~~
Joeri
Cut it into sections smaller than pi/2, then approximated. Looked close enough
to the naked eye that nobody complained.

------
bitshaker
I am part of a team created a system that took the complexity of the human
metabolism and reduced it down to a few vital statistics that are then used to
create a individualized formula that tells people how to sustainably improve
their metabolism and lose weight and gain muscle. The amazing part is that it
works for everyone because it is custom tailored to them. The hard part here
was 30 years of testing, thankfully not done by me, but by the CEO who happens
to be a bodybuilder.

The formula is how much protein, carbs, and fat to eat and the appopriate
exercise of 3 half hour workout sessions a week. No supplements or anything
else. Just food and small amounts of exercise to stimulate hormone response.
This is way more complex than some tracker or calorie counter. It takes into
account insulin spikes, metabolic damage assessments, glycogen storage, and
much more. The hard part here was integrating ~10 different disciplines in
various sciences. Everyone had a piece of the puzzle, but we had to put it
together.

That is then fed into an app that can then pick foods for you based on your
formula that is then constantly refined based on your results. We took 1000
people through test runs tweaking our code to get it right. Now it works for
everyone that we put on it and actually uses the system.

Our next challenge is the psychology and habit forming parts of the app we
have built.

Oh and of course competing with well funded competitors in the space, but at
least nobody can claim our results because they just track things instead of
allow people to really plan for health.

Edit: Since you asked, it's called mPact (for metabolism impact) and the
corporate site is at [http://mPact.io](http://mPact.io)

~~~
tomjen3
For habit forming, look to beeminder - its interface is clunky and its
termonolgy too geeky, but it is also the best way to make "some day" to day
(some day I want to lose weight) and keep it up that I have found.

~~~
bitshaker
Thanks. That's one avenue we've looked at already. I have also looked at
traditional gamification techniques and found them initially motivating for
users, but then engagement falls off a cliff. This has less to do with the
techniques I suspect and more to do with the fact that people eventually hit a
goal weight and think they are suddenly "fixed" and can go back to what they
were doing before. We actively work to encourage the better mindset of making
a permanent change in lifestyle where our system then moves from educational
and informative to simply a tool to continue to plan and keep on track.

------
lqdc13
I think problems are hard when you are new to a domain. After you get some
practice, nothing there is really that hard.

For me, the first hard thing was implementing this
[https://en.wikipedia.org/wiki/Dead-
end_elimination#Generaliz...](https://en.wikipedia.org/wiki/Dead-
end_elimination#Generalizations)

Probably because it was the first algorithm I implemented with no reference
implementation to look at.

The second hardest was a high performance proxy that can redirect to another
proxy and can collect specific types of non-encrypted data.

------
johnbender
NP-completeness for automatic fence placement between two specified
instructions in the presence of arbitrary goto statements. Reduction is from
negation free 2-SAT to control flow graphs for real programs.

Didn't make it into my first paper, hopefully will end up in my thesis :)

------
JensRantil
I used to work at a VOIP provider where users started reporting choppy audio.
After a week or two we nailed it down to customers that had "call recording"
feature enabled. Essentially their calls were being recorded and streamed to
an audio file to be accessed later through a web interface. After yet another
week of investigations we noticed that disk IO was fairly high on machines
that had big customers with call recording enabled for all their endpoints. We
drilled the IO issues down to the WAV file format that has a header that needs
to be updated for every write to accomodate for the updated length of the
recording. This required a lot of disk seeks on spinning disks and
unfortunately file flushing could not be disabled.. Switching to a RAW audio
format that we post-processed after the call resolved the issue.

------
lunixbochs
Running OpenGL on mobile (OpenGL ES) devices. From inside an x86 emulator.

[1]
[https://github.com/lunixbochs/glshim](https://github.com/lunixbochs/glshim)

[2] [https://youtu.be/8ibx-2ZBLVg?t=76](https://youtu.be/8ibx-2ZBLVg?t=76)

------
davidst
Built the head tracker for the Amazon Fire phone.

~~~
bshimmin
Was that challenging from a technical perspective, or from a "I can't quite
believe I'm doing this" perspective?

~~~
davidst
It was challenging in every way imaginable. There was no existing algorithm
that could deliver the accuracy and robustness we required. It had to run
within the limited power budget of a phone. And it had to be done quickly
before the hardware became uncompetitive.

At the time it was given to me it was a rough demo with no clear path forward
to shipping. We had no metrics to tell how good it was, how good it had to be,
or whether we were even making progress. We had no team of computer vision
experts to work on core algorithms. We had no idea if the problem was solvable
at any amount of power consumption. There were more than a few people within
the company who thought it couldn't be done.

I want to be very clear about credit. I put this as the hardest thing I have
ever done but I was only the manager in charge of the project. While I built
the team and owned the problem, I did not write the code or design the
algorithms. I had incredible people who did outstanding engineering work and
researchers who advanced the boundaries of computer vision. It was a privilege
to work with them and I am proud of them.

------
dave31415
I recently wrote a time series modeling algorithm. I tried some existing open-
sources packages but none worked very well. I really just wanted to decompose
the time-series into a set of linear trends merged together in a continuous
way. It turned out there was an elegant algorithm to do this called L1TF from
Boyd's convex optimization group at Stanford. Also found a python
implementation on Github to get started with. The paper mentioned that it was
easy to add all kinds of things such as seasonality, discontinuities, outlier
rejection, auto-regression etc but didn't give formulas. Just waved the hand
like many academic papers do. I ended up figuring out how to add all these
things but in order to do so, I had to learn a large part of the field of
convex optimization in my after-work and vacation time and perform some
lengthy, difficult calculus to arrive at the formulas. The algorithm worked
great in the end. I find it funny that while the client is satisfied, they no
idea that they now possess one of the world most powerful time-series
algorithms which involves ideas from some of history's greatest
mathematicians: Newton, Lagrange, Euler, von Neumann as well as many of the
past century's luminaries. Open source part is here.
[https://github.com/dave31415/myl1tf](https://github.com/dave31415/myl1tf)

------
kschua
Poltergeist Room problem.

Back in the CRT monitor days, I was working for a computer repair company.
There was this particular client (in the defence industry) who had monitors
that started flickering and having a greenish hue at its sides after a week.

Every week, we had to go to his office to swap the monitors and bring the
faulty ones back to recalibrate (it was costly, but hey, its a Defence
contract and those pay big bucks)

It didn't matter whether the monitor was brand new or recalibrate ones, it
just started flickering and had greenish hue after a week, and it only
happened in that room. Other monitors outside that room and in other levels
were fine, thus the room was dubbed the Poltergeist Room (as they blamed
spirits for messing with it).

One day after the monitor exchange, I returned to the office and my supervisor
queried me as to why I didn't reply to his multiple pages (we were using
pagers back then). I realised I was in the Poltergeist Room when the pages
were sent and therefore did not receive any page. It then dawned on me, "Could
it be some electro magnetic interference from another level directly above or
below that was playing havoc".

I went back to the client the next day to tell him what I thought and he
(being electronics trained) realised that above him was a defence lab carrying
out EMF experiments, which could have caused the monitor problems. He got to
work to build a simple Faraday cage to prevent EMF from getting to the
monitor. Since then, the monitors worked perfectly.

~~~
kranner
No 'degauss' button on those CRTs? :)

------
netik
Trying to figure out how to scale and secure Twitter. From a dozen people in a
room to 2800 when I left, it was a challenge every day.

~~~
kzisme
How did you like your Twitter experience?

------
lifthrasiir
This is not the most difficult bug I've ever encountered, but it is definitely
one of the most interesting bugs.

I had encountered some seriously incorrect outputs from the application
server. The output in question was a function of internal states and current
time (rounded to hours, it was kind of "hourly" display). The application
server was set to log many input/output pairs so I was able to identify non-
trivial amount of such errors, but I was unable to determine the cause. Common
causes like memory corruption, time zones (as the business logic heavily
depended on the local time), NTP synchronization and even the interpreter bug
were considered and then rejected. Finally, after two weeks or some, I tried
to simulate the function with varying current time and fixed internal states,
and surprisingly a portion (but not all) of output from the past matched to
the observed output!

It turned out that glibc `localtime` _can_ misbehave in the way that it
ignores the local timezone when it was unable to read `/etc/localtime`, and
the Linux box the server was in had some issue on reading that (I never had
fully identified it, this read was probably the only disk I/O from that server
anyway). In lieu of this finding I have exhaustively and posthumously
inspected the past logs; it was determined that the gross error rate was in
the order of 10^-4 (!), and the way `localtime` used meant that the error can
only alter a portion of the output. Studying the glibc code revealed that
setting `TZ` environment variable would disable the UTC fallback, so I did so
and the error was gone.

Lesson: Learn your moving parts, even if you don't know them in advance.

------
ruirr
I had the luck to have had to solve to though problems. Like in a performance
analysis in the major telecom operator in an African country, finding many
flaws in the infra-structure, circular DNSes (do not ask), and to top if all,
their Internet reseller selling them the double of the real bandwidth, and to
top it all, them having bridging enabled to all the country due to a vendor
telling them "put this line on the central router". Or when a colleague wanted
to upgrade technologies, implementing filters at the CPEs after reading DOCSIS
RFCs on a cable company, and seeing the infra-structure upload traffic diving
to less than half. Or picking up the Linux department and even before
reimplementing all the servers, optimising servers that went from 9x% CPU
utilisation to 10%. Or (re)implementing the middleware for 2 cable Internet
companies, in which one of them had some functionalities in Java and I
reimplemented in C to see some operations that were done in 1h being done in 5
minutes.

------
sdrothrock
How to isolate and identify a human hand with the fingers spread and track it
in real time on an iPad.

Edit: Whoops, I realized this was ambiguous. I was using an iPad camera to
track it and displaying the result as well as using the detection to trigger a
camera shutter.

------
echeese
When I was a teenager, I was doing some experimentation with 2d shadows in
Flash. The first version was done by using a BitmapData by iterating over
every pixel and lighting it if it was not obscured. This took ~15 seconds to
compute.

I was happy with this, until a friend challenged me to make it realtime. I
managed to re-implement the same thing by using the built-in vector drawing
(and as a bonus, this also gave anti-aliasing) and managed to get this down to
15ms.

The third version was using the 3D acceleration, and managed to get 100 lights
to render in realtime. Was pretty proud of myself and I wrote an article about
it, which was cited a few times by different people.

------
andersthue
Shortly after I started my first consulting business back in 1998 one of our
customers wanted us to upgrade their compaq server from one disk to a raid.

We started Friday after normal working hour by checking that the backup worked
(it did) then proceeded to upgrade the server with a raid backplane and three
new scsi disk, installed Windows NT, installed the backup software and started
a restore while getting some takeaway.

The restore only took like 15 minutes - and to our horror we discovered that
the previous IT admin had set it up to do an incremental backup on the same
dat tape overwriting it every day!

Ok, no worry, we had not used the old disk, so we installed it and turned on
the computer.... Nothing happened.... Strange, we removed the raid backplane,
installed everything as it had been... Still nothing.

After 24+ hours working on the problem, including several hours talking to
compaq support (best support ever!) we had to go home for some sleep. When I
got back to the server room I fired up Norton Disk Editor and painfully
figured out the MBR was all zeros on the disk, luckily the rest of the disk
look like correct data!

Several hours later, just before sunday turned to monday I finally got an MBR
written using NDE and NDD, booting the system and seeing everything was all
right.

Monday we told the customer we had some problems and would do the upgrade
another day (after we had taken multiple backups :)

------
kephra
I solved many challenging problems, e.g. I wrote a parser generator, that can
generate itself, I wrote an UN/EDIFACT parser that was parsing the human
readable UN standard to create a parser for a semantic translation. My Y2K
PTFs run on every MVS and OS/360 system. I did a lot of machine learning in
the last 10 years, e.g. optimizing maintenance of Siemens power plant turbines
or quality control for injection molding machines.

But ... I'm taking the biggest challenge right now. I'm coding my Onyx
database client idea for 3rd time. The hardest problem was to start o3db. I
failed badly with Onyx 20 years ago by burnout holding over half a million
lines of C++ in flow together with nearly 10k lines of my own 4gl, during the
3rd customer installation of Onyx. I was very shy of coding UI/UX afterwards,
escaped deep into server stuff, machine learning - escaped as far away from
user as possible.

So, my biggest challenge was to start Onyx again: A user facing UI/UX for
common business database applications with its own fourth generation language.
I've decided for Scheme as an intermediate language this time, and the
prototype running well. I now have a non recursive Scheme interpreter and GUI
running in browser, able to process the meta tables defining an application.
Its still a long road to my vision. But to start a project again, I failed
with a burnout 20 years ago, and to code it with actual technology, was the
biggest personal challenge.

/join #o3db on freenode, if interested in a startup to create common business
database clients for the web.

------
sergiotapia
Performance tweaking when I used to build Blackberry apps back in 2009. It
sucked so much that it turned me off of Blackberry phones entirely, as a
consumer and as a developer. Remember these were the days where Blackberry was
-the- phone to use. BBPin was red-hot and the iphone was too expensive for 99%
of people.

Tweak -> Compile -> build deployable package -> push to phone -> wait 6
minutes -> test on phone -> repeat....

------
AnimalMuppet
Function a() called function b(). When function b() returned, a local variable
in function a() had changed from 0 to 1.

"Aha!" you say. "You're smashing the stack! Function b() is writing outside
its stack frame."

But function b() was provably not doing that.

Function b() called msgrcv(), which has a very badly designed API. It takes a
pointer to a structure, and a size parameter. The structure is supposed to be
a type field (long), and then a buffer (array of char). The size parameter is
supposed to be the size of the buffer, _not the size of the structure_. The
original code that implemented this came from a contractor, and they made the
very natural mistake of putting the size of the whole structure in the size
field. This meant that an extra long was read from the message queue, and
smashed the stack.

But that should mess up the stack from from function b(). How did it mess up a
variable in function a()? Well, the compiler put that variable in a register,
not on the stack. So when b() was called, it had to save off the registers it
was going to use, so a()'s local variable wound up in b()'s stack frame.

It took me most of a month, off and on, to figure that out.

------
zallen
Answering this question feels like the hardest problem I've solved yet... ;)
Because, I don't know: I've never really thought "this one! THIS is the
hardest!" You just iterate and things get more and more challenging as you
build skills. What seemed hard to a junior tech doesn't seem hard to me as a
senior tech now. It's all just engineering. It is all just sitting down,
reading manuals or prior art, getting familiar with protocols or fundamentals,
and building maps in your head until you understand something. Then building
proofs of concepts and outlines; then, applying a bunch of troubleshooting
principles; repeat until problem is solved. I've written academic papers this
way, I've built streaming servers off esoteric industrial process-control
database APIs, I've done process visualizations, I put a model railway online
(before that was an out of the box thing)... and it's all the same: use what
other people did, understand it, and then build from there.

------
thaumaturgy
I have a few little trophies I go back to every once in a while when I'm
feeling like a crappy programmer.

\- I worked out, on pen and paper, sorting networks on my own a few years
before the Wikipedia article on them existed. I was looking for shortcuts in a
Quicksort implementation. I hadn't read _Art of Computer Programming_ yet,
which is probably the only other place I would've been likely to read about
it. It hadn't been covered in any of the other programming literature that I
was devouring at the time.

\- I wrote a variable interpolator in COBOL. COBOL has no string operators or
anything resembling a string data type. This one was tricky. I was working as
a programmer/operator at a school district at the time and the central hub of
their IT was a Unisys mainframe that ran COBOL and WFL. There weren't any
punch cards anymore, but everything ran as if there were; for any given job to
run, say, report cards, you had to go into the WFL job and edit a two-digit
school code in half a dozen places, in "digital punch cards", which would then
be fed one after the other into COBOL programs. This was error-prone and I
wanted a way to define a couple of variables at the top of the job file and
then have everything work after that.

\- I worked for a BigCo that used Remedy for its internal support systems.
There were some latent training issues in the internal support department and
support requests kept getting modified by unknown people, which would cause
the requests to get mishandled and would irritate various other departments. I
found a way to sneak some code into the Remedy forms system and I cobbled
together a very rudimentary communications protocol between several forms so
that all changes to any form got logged to another form, along with the user's
id. Remedy had no loop logic at the time. That actually made it to a Remedy
developer's group mailing list once and I was a big fish in a very tiny little
puddle for a day.

\- I reverse-engineered portions of the .dbf format that FoxPro uses, and
wrote software that could convert .dbf files into MySQL tables. The date
format was tricky. It was an 8 byte field where the first four bytes were a
little-endian integer of the Julian date (so Oct. 15, 1582 = 2299161), and the
next four bytes were the little-endian milliseconds since midnight. This is
not documented anywhere.

Those are some of my favorites anyway. 30 years of programming, there's been
some fun stuff along the way.

------
rvalue
Writing an implementation for parallel Travelling Salesman Problem w/ B&B
using MPI and getting some god damn speedup

~~~
Raed667
Is this open-sourced?

------
ex3ndr
Making android lists scroll smooth

------
danudey
Setting up an IPSec VPN from a Linux server to Amazon VPC and running data
over it. There was a host of documentation on how to do similar things with
the appropriate tools, but as always, it was document A with 40% of the
puzzle, document B with a non-overlapping 30% of the puzzle, and document C
with an overlapping 40% of the puzzle… at which point I realized that all
three documents were using different approaches/conventions/etc.

Documentation for the tools available seemed to varyingly assume that you
either a) understood IPSec well enough and only needed to know how to use this
one tool, or b) knew everything you needed to know, minus a few hints on the
syntax of individual files.

Eventually I got everything working, but performance was abysmal. Sometimes.
Sometimes SSH sessions opened instantly. Sometimes they opened slowly but then
worked fine afterwards. Some tools were awful and others worked okay.

Eventually I realized that the IPSec configuration set up two tunnels to
Amazon, but only set up actual routing (defining endpoints) for one of them.
Thus Amazon was load-balancing packets over both tunnels and my Linux
implementation was dropping 50% of packets. For established TCP connections
this was fine because we had basically zero latency to VPC so retransmits (for
what we were doing) were almost free since they would be discovered when the
next packet arrived successfully, but for SYN/ACK packets a drop would result
in an annoying wait.

Unfortunately, the tools don't allow you to define redundant/overlapping
routes, so I couldn't set up two tunnels; I had to just configure one tunnel
and leave the other one down so AWS wouldn't try to send data over it, and
then just hope that that endpoint didn't go down at an inopportune time before
I'd either set up some kind of load balancing scenario on my internal network
(internal BGP maybe? ugh!) or given up entirely on the project.

After weeks of working on this specific task (the VPN setup) and making
literally zero progress some days, googling for literal hours with no useful
results, and trying various permutations, when I got it working I felt like I
was the only person on the planet who'd ever done this before, since I was
pretty sure that no one on the internet had ever written about it at least.

Even though the project was ultimately scrapped, I still feel like I learned a
lot, and maybe I should feel like it was wasted time, but it also felt like
quite an achievement to succeed.

~~~
JensRantil
This is funny because I am basically about to try and diagnose a _very_
similar issue with a VPN tunnel between a Cisco ASA and AWS. I'm also seeing
SYN/ACK being occasionally dropped and TCP connection states ending up in WAIT
state.

------
studentrob
Mine is more a social solution than a tech one. Hope that counts here!

Years ago I came up with a simple equation for determining priority of
software engineering bug fixes and small features:

Priority = (Benefit the feature provides to the product) / (Time to complete
the feature)

where benefit is defined by the business side using any scale (say 1-100), and
time to complete is defined by the assigned software engineer using any unit
(perhaps man-hours). Regardless of what range the numbers fall in, 0 to 1 or 0
to 42, you end up with an ordered list of tasks which equally value business
value alongside engineering time.

I came up with this while working at a medium sized company. I was frequently
tasked with too many things to do. Despite tasks being organized in a Redmine-
like tool, the implementation was still done in random order because nobody
could define priority. This led to much miscommunication about what I was
working on in the recent past and future. I used the equation to better
communicate my activity and future plans with the business side. Given an
ordered list of tasks from this equation, anyone could see clearly what was
being worked on next.

The business side resisted attaching a numeric benefit to the features,
presumably because that's hard. But it's equally hard to define the time to
complete a software engineering task, and I eventually convinced them we
needed to at least try to be scientific about both.

n.b.: I used this while working on a mature system. For a newer project or for
tasks with more dependencies, it's probably still complicated to define
priority. In the setting I was in, it worked great.

My boss's boss however thought it was condescending and nobody aside from
myself ever made use of it. I hope to make use of it again one day, but after
one bad experience with a medium sized company, I've stuck with smaller places
where this is not as necessary.

------
stephenr
While I've had some weird technical problems (I've worn a number of hats
across Network/System Admin, to both front and back end web Development) the
hardest is always the non-technical issues.

A few years ago I was contracting for a company that had a Native American
Casino as a client. They wanted to build a gamified app/site to engage their
customers more.

The single hardest problem was trying to look at the situation from the
players point of view. Gambling like this (slot machines) is inherently an
illogical thing to do - they _know_ they're never going to make back the money
they put in, but they walk away with a smile night after night.

Trying to rationalise (so we could understand their goals and what they might
want out of an app/site targeted at them as players) proved impossible for
basically everyone on the team.

~~~
jamesdelaneyie
How did it pan out in the end? Did you talk to the people the app was targeted
at?

~~~
stephenr
They ran a small user session with some 'rewards club' players, to get
feedback on the MVP that we built.

It did go live eventually but I don't think it's taken off as they hoped.

------
someremains
Hard problems are great because once you solve them, you get to solve even
more challenging things as a result. I (with a small and great group of
people) build a lot of physical things that are meant to look deceptively
simple through various means, mostly being the disappearance of artefact of
support. This leads to a lot of great design, engineering, documentation, and
procurement logistic challenges. Past: wrapping a building in custom made
chain-mail and need it to a) fit like a glove, b) not fall off, and c) not
cause us to go broke. Current: 18,000m2 (196k SF) of entirely custom, double
curved aluminium panels. The unique part count is currently hovering around
1,000,000 distinct components that all need to end up on a piece of paper (the
building industry is big, slow, and strange).

------
SugarfreeSA
I think that this is such a tough question to definitively answer because the
measure of difficulty of a problem is all relative to the particular point in
time.

I am currently working on my thesis in artificial intelligence which to me
seems tough because I have never written a thesis before. However, at work, I
am dealing with technical software engineering problems that will seem easy
after I have solved them.

My first industry project involved creating a generic form builder which could
ultimately be used as a survey tool to draw statistics from. This seemed
extremely challenging at the time, but now that all of the design decisions
have been made, and complexities solved, I could redo it pretty easily (even
though we shouldn't recreate the wheel)

Good thought provoking question though! Thanks!

------
dvirsky
I was part of a team that designed and implemented a P2P UDP based video
streaming protocol, that receives chunks of a stream simultaneously from up to
hundreds of peers. It wasn't a "big problem" per se - I worked on seemingly
harder problems before and since - but this one was really hard to get right,
it was a very non deterministic beast that was extremely hard to test. I
remember lots of times that I secretly felt like we would never get this thing
working as well as we wanted, but in the end we did.

I left this company long ago but they appear to be going strong still.
[http://www.giraffic.com/](http://www.giraffic.com/) . I'm sure they improved
on that work a lot since then.

~~~
phpnode
Out of interest, did you use skiplists to reassemble the streams in order?

~~~
dvirsky
Wow, it was so long ago, I don't remember the exact data structures, but I'm
pretty sure it wasn't a skiplist.

If you've already asked, one cool part of that technology is that the order of
received packets is not important to assemble the stream. Basically every 1
second of video is reassembled without importance to the order of packets
received. You need N packets to assemble a "data frame", IIRC pending
incomplete data frames were stored in a simple hash table, but honestly it was
so long ago I don't remember.

------
kohanz
We had a dependency limitation where an SDK we relied on only had an x86
release while our software ran (necessarily) as x64. I was quite proud of
myself when I wrote (in a relatively short period of time) an IPC-based
(memory mapped files) solution to communicate between the two seamlessly
(performance mattered, as we were doing real-time imaging). It felt like a
problem that some of my co-workers would have just given up on and said "it's
not possible". Might not have been the "toughest", but in terms of
time/difficulty trade-off, it ranks up there. Perhaps it would be trivial for
others.

Of course, the real solution would be to press the dependency provider to
release an x64 version, but we were not a priority of theirs.

------
chubot
All the problems I've solved seem about equally hard, since I put my full
effort into solving them :) I tend to go into new subfields where I don't have
a background, so the hardness is probably just proportional to the length of
time I spent on the project.

The problems I failed to solve seem are the ones that seem the hardest, of
course. I tried to write a cluster manager / distributed OS by myself,
starting almost from scratch, and that was too much. I spent upwards of 4
years on it, and had some success, but I'm starting to move on.

In particular, I learned that having a reasonable amount of security with
reasonable amount of development effort in a distributed system is still an
unsolved problem. It's basically a bottomless pit of work.

------
Libbum
Difficult to say this problem is solved yet - the jury is still out, but I've
done a good deal of work on identifying what the mechanism of a defect in
superconducting phase qubits may be.

TL;DR: Two level system defects are a 20 year old unidentified noise source
that can be described by an oxygen spatially delocalising in an amorphous
portion of the underlying circuit.

See
[http://dx.doi.org/10.1103/PhysRevLett.110.077002](http://dx.doi.org/10.1103/PhysRevLett.110.077002)
and
[http://dx.doi.org/10.1088/1367-2630/17/2/023017](http://dx.doi.org/10.1088/1367-2630/17/2/023017).

------
evincarofautumn
Designing a usable static type system for Forth-like (concatenative)
programming languages.

------
karterk
Implementing all-pairs similarity search on a few hundred million records.
Naively approached, the complexity of this is O(N^2), so had to come up with
novel ways to make it finish in a reasonable amount of time and with limited
resources.

~~~
DanielRapp
I'm sitting with a similar problem right now! Got any pointers?

------
icpmacdo
I have been programming for a bit more than three years. Making a half decent
app in Cordova for some classes was the hardest I have worked at problem
solving stuff in programs. Looking back on it now the code is really, really
bad.

~~~
mataug
Wow, Is cordova that bad ?

~~~
abhinai
Cordova is not bad, trying to program a mobile app in HTML5 is. (1) You have
to build most of the UX interactions yourself (2) Performance is a bitch. You
can spend months trying to optimize your code and it still sucks (3) Different
versions of android have different levels of support for HTML5 api. In the
end, you get to use the lowest common denominator. (4) Windows phone reloads
your Javascript / HTML code every time someone starts the app giving an
obvious "reload flicker".

Basically it is one of those unfortunate cases where the first weeks makes
everything look really promising (single codebase and all) and it is only
after hard work of several months that you realize that there is no way you
are going to win this battle.

~~~
timrichard
Hi, just wondering if you've looked at Intel Crosswalk? It's intended to help
with issue 3.

------
chazu
This is a pathetic answer, but for me its probably an NLP web endpoint I
built. The task was to take a query and categorize it into one of several
categories provided when the server was started. So, for example, if a user
submitted the query "barbie doll", the endpoint would return the three most
likely categories: "toys", "clothes" and let's say "office supplies".

The way I did this was by using NLTK to compare the hypernym paths of the
words in the query against the hypernym paths of the category names. I wrapped
it in a tiny flask app and it was surprisingly fast enough for an MVP.

~~~
tmaly
what became of this project? It sounds interesting.

------
RobBollons
I usually find architecture problems to be the hardest to solve. The hardest
one I've had to deal with is taking a legacy web application ~3 million lines
of code and giving it some form of architecture so the product can have a
sustainable future. Some of the issues included inline CSS styles, Core logic
written in linear Classic ASP, ASP Web Forms written in a linear fashion and
so on. As you can guess, what made it hard was trying to solve these issues
without breaking anything, this is an obvious example of why automated testing
and code quality are so important.

------
groar
Clearly, when I think about the hardest thing I ever coded, I have the
following story in mind.

Back in 2002 I was writing a floppy disk driver for the little OS we were
writing with a friend. It turned out finding anything else than very sparse
documentation was really hard, plus for some unknown reason the floppy drive
behavior seemed to be of non-deterministic nature. Maybe the fact that I was
15 didn't help.

At some point, after many nights spent on debugging it, it just worked. I
still don't know why. I never changed any line of the code after that moment,
by fear of breaking it.

------
lordnacho
Hard to pin down one in particular:

\- Anything where you're looking for a race condition. It tends to be hard to
reproduce, and instrumentation can make it go away entirely, leaving you with
a need to conjecture about what might be happening. Quite satisfying when you
find it, but again because it's rare you don't know if you're really solved
it.

\- Built a cross-platform, cross-language messaging system for trading.
Combined UDP and TCP, had detection of downed servers. A lot of fiddling with
network stuff, performance optimization on all platforms, both VM and native.

------
FarhadG
Not the most difficult but one of the most interesting in my professional
software career: I implemented a consistent eventing model between WebGL and
DOM. As a contributor to Famous' 3D engine, I wanted to have a similar
eventing system between WebGL and DOM elements (element.on('click', 'scroll',
etc.). I decided to use a "picking" model and encode geometry IDs in base 255
(4+ billion IDs) into the color buffer and provide a consistent API for both
the DOM and WebGL renderer.

------
danialtz
Turning 21 TB of images into 12 one page excel sheets with implication of
deciding factor between few cancer drugs. And the coding part was the easier
part than biological data reduction.

------
stevoski
Making a website you with functioning log-in and log-off.

This was the 90's. It was surprisingly hard to implement this in a workable,
reliable, secure way, and no one in our company of 50 programmers had ever
done such a thing before!

I recall being puzzled for way too long at how to prevent someone from coming
to a browser that had just logged off from our web app, and clicking the back
button a couple of times to be logged in again.

Now, of course, it is a common and easy-ish task.

------
randomsearch
A simulator gave different results depending on the length of the filename of
the executable under test. Only when I ran it on two separate but identical
machines with different host names (I used the hostname in the file path to
keep a record of where things were run) did I discover the cause of the
differences in results I was seeing, which was due to the way that syscalls
were handled by the simulator.

------
declan
If you're talking only about coding (and not other life challenges), the
hardest problem I've solved so far has been figuring out how to build
[https://recent.io/](https://recent.io/) with my co-founder. Recommendation
engine, fetchers, iOS app, Android app, etc.

There's a very big difference between concept and working code. :)

------
bliti
In code: Naming things in a manner that makes the code readable. It's a
constant challenge.

With cars: troubleshooting and fixing a ferrari 599 without the required
factory diagnostic computer. You can't beat a multimeter and some elbow
grease. It was a faulty flow meter.

In general: figuring out what to do with my life. It took me a bit but was
worth the time. Now I can focus on doing that and just that.

------
luck87
A GPU-optimized bruteforce for TrueCrypt volume:
[https://github.com/lvaccaro/truecrack](https://github.com/lvaccaro/truecrack)
. I have extract the chiper algorithms and build a parallel version of them on
Cuda tecnology. When I started the project in 2011, the gpu world was not so
kindly as now.

------
toxicFork
Making a trainer for a game. For example: "infinite bullets". It would crash
the game for weird reasons. I ended up patching many places of the executable
to prevent the crash.

In the end I found out that I managed to write a crack for the game by
accident :D Later on I inspected a crack from another team, it would patch the
same regions!

------
krapp
Getting SDL's incomplete types to work inside of a std::vector of
std::unique_ptrs and compile with Visual Studio's compiler, and building a
basic (rudimentary, probably not awesome) entity-component system in C++ to
work with them.

And yes, I know a lot of what I just typed will probably put real game
programmers' teeth on edge.

------
zw123456
About 10 years ago I built a Scanning Tunneling Microscope. It took me 2 years
to complete it and to get it working.

~~~
ollyfg
Wow, that sounds really cool! What parts did you use (how much was "from
scratch")? Any blog post or article about this? I'd love to read more.

------
RogerL
Robustly tracking and localizing very small objects in high clutter
environments using computer vision.

------
rshetty
Made Websockets working with reverse proxy(IIS) sitting in between the browser
and a Golang app server.

------
neurotech1
I used to repair EEG systems and we had about a dozen "noisy" units. The
boards all tested fine, but were noisy when hooked up to measure brainwaves.

It turned out people were running the SLA battery completely flat repeatedly,
and subtly "wearing out" the battery.

------
azeirah
Building our own OpenGL game engine in C++ to run on a banana PI, which has
severely broken drivers. It was both non-fun and very fun. We could've just
chosen to do it on an FPGA, but nooooo we had to pick an actual gpu ;__;

------
racl101
Setting up a web service that can transcode audio files (namely mp3) that are
uploaded to a server using ffmpeg. In fact, I know I could still improve it
but I haven't figured out all the things ffmpeg can do.

------
avmich
I liked experience writing Tomita LR parser generator in J - while learning
both parsing and J :) . ~700 lines heavily commented code with tests... Of
course now it doesn't seem all that hard.

------
imh
Finding the Green's function for a nasty differential-difference equation.
Great stuff, but I feel sooo embarrassed thinking back on how I attacked this
kind of problem back then.

------
golergka
Modifying a Unity game with in-house 2d framework to correctly process complex
unicode strings and input and render LTR, far eastern and emoji characters.

------
brianwillis
This might not be the hardest bug I've tackled, but it certainly took the
longest to solve.

It was the early days of SOAP, and I had been assigned the task of integrating
my employer's software with a third party's, so that the applications could
share data. This third party org was a wealthy, powerful mega-corporation; and
my employer was, well, not. The third party produced a spec for the interface,
expected us to follow it, and offered no help from there.

I built a solution. It worked on my machine. Solved the problem. All was right
in the world.

I moved it to the test environment. It worked again. Demoed it for one of our
customers, and everyone was pleased.

Deployed it to our first beta tester. One lonely employee working accounts
receivable, tucked away in the corner of our customer's office.

It crashed.

I checked everything. I mean everything. There are still particulars of that
little Windows 2000 workstation that I can describe vividly. Which programs
were installed, which patches were installed, how Windows had been configured,
how the firewall worked, I even got permission to install a packet analyzer.
My employer only had a handful of customers, and the beta test machine was
near our offices, so I was over there personally a lot over the following
weeks.

We brought in the customer's network support people. They found nothing. They
could see the packets leaving, and an error coming back, but couldn't offer
more than that.

We brought in the best networking engineer in my company. He was stumped.

What really shook my confidence was knowing that competitors mine had gotten
this interface working. This wasn't some half baked project that I could blame
on someone else. Others had succeeded where I'd failed.

I practically had to walk across broken glass to get on the phone with the
third party's development team, but with enough pestering I pulled it off.

The phone call involved me sitting at the beta test workstation and firing off
a request so that they could view it hitting their servers live. The developer
who I spoke with immediately spotted the problem.

You see, when you send a SOAP request, you send the date and time that you're
making the request along with it. The clocks on the client and the server were
too far out of sync, my requests appeared to be coming from the future, and so
the server disregarded them with a blunt error. Interestingly, the workstation
clocks at my company's office weren't too far out of sync, which is why it
worked in one place and not another.

Stuff I learnt:

1\. Third party interfaces require a point of contact at both organisations
who can talk with one another. This is non-negotiable.

2\. If you send an error message that reads "Error", you're a bad developer
and should return your computer science degree to your university and demand a
refund.

3\. No matter how well written the spec is, something always gets left out.

4\. Persistence maters more than anything.

------
fbomb
I found a rather elegant solution for the halting problem. I would share it
here but it won't fit in the margins.

------
codezero
As an undergrad, I worked in a lab that had a satellite all-sky imager.

It had three CCD cameras with strip imagers that were combined into a single
all-sky image every orbit.

I was given a FORTRAN codebase that dated back to the 70s (supporting
functions) and was told to figure out the best way to pick the start and end
of the orbit as far as image frames were concerned.

The pointing data was in satellite frame-of-reference quaternions [1], and the
satellite orbited about the axis of the Sun-Earth line, approximately.

Approximately was the key. Since it wasn't at a perfect 90 degree angle, the
CCD strips each crossed over the plane defined by the Sun-Earth line and the
axis orthogonal to the Earth's orbit (I referred to it as "south") at an
angle.

So, if you want to stitch together an image of the sky that looks continuous,
but the orbit of the imager wobbles a bit, and different discontinuities show
up every day, how do you do it?

The leading CCD could be entirely across the southern line when the other CCDs
were just starting to cross it. This created a lot of problems with how you
define a complete orbit that lacks discontinuities and makes intuitive sense
so others can understand the code.

I decided to pick the point where the middle of the central camera crossed the
plane as the frame of reference for the start/end point.

Ultimately, this project took me about three months, just to get used to the
code base, the spatial coordinates and transformations needed to make sense of
the data, and then to finally write the code.

The meaningful changes I made in the commit consisted of about three lines of
code.

I found the commit message:

Fixed problem near seam of map where start and end of orbit meet. The
orientation of camera 2 at the start of the orbit is now used to draw a
reference great circle on the sky. Near this boundary pixels are tested
individually to decide whether they are part of the current orbit and should
be dropped in the skymap. Introduced torigin to keep track of the time origin
for the lowres time map. This is added to the Fits header of the time map as
keyword TORIGIN (used to be STIME). Times tfirstfrm and tlastfrm are assigned
the time of the first and last frame, respectively, for which at least one
pixel was dropped in the skymap. These are written into the main header of the
skymap as keywords STIME and ETIME. Added extra extension to lowres maps
containg nr of pixels contributing to each lowres bin

[1]
[https://en.wikipedia.org/wiki/Quaternions_and_spatial_rotati...](https://en.wikipedia.org/wiki/Quaternions_and_spatial_rotation)

------
hitlin37
there are lot of hard bugs in embedded Linux development. But once solved, its
not hard anymore :) For example, code porting for different architecture could
be tricky in some places. Porting a device driver to your new chip could be
tricky as well especially if your HW vendor isn't helpful.

------
CmonDev
To sum up: dealing with other people's non-open-source code. Various COM APIs
for example.

------
recursive
Uploading my perl cgi scripts as text instead of binary. I solved it in only 4
hours.

------
pathintegral
Well, the obvious amswer would be "my wife". Except she probably disagrees.

------
pvaldes
The 'hello world problem' a.k.a. birth. Nothing compares with that.

------
interdrift
A special graph dissection algorithm. It took me 3 months. >.<

------
pathintegral
The obvious answer would be my wife. Except she would disagree.

------
Agentlien
Implementing a real-time GPU-based fluid simulation.

------
JoshTriplett
Depending on which kind of "hard" you mean. Some problems are straightforward
but very involved to fix, while others are incredibly difficult to investigate
but easy to fix once found.

My first significant contribution to FOSS was to port OpenOffice.org to work
without the then-proprietary Java, so that it could go into Debian main (and
other distros with similar requirements). At the time, OO.o took 8 hours to
build, or 3 hours with the wonders of ccache, and I was hacking on the build
system itself, so incremental builds were often broken. (And the first thing
OO.o built was its own implementation of make.) So over the course of a month
or so, I would hack on it, rebuild to see it get a bit further, and repeat
until it finally built without error. The net result was dozens of patches
submitted and merged into Debian and ooo-build, and the 1.1.0-2 changelog
entry listed here, which made it all worth it: [http://metadata.ftp-
master.debian.org/changelogs/main/libr/l...](http://metadata.ftp-
master.debian.org/changelogs/main/libr/libreoffice/libreoffice_4.3.3-2+deb8u1_changelog)
('The "Wohoo-we-are-going-to-main" release')

The most _challenging_ problems were two different mysterious crashes in BITS
(biosbits.org), a Python environment running at the firmware level. Because of
the environment, a crash means a sudden unexplained reboot, with no diagnostic
information.

First, I was trying to debug a crash in the initial CPU bringup code, which
brought the CPU from 16-bit real mode to 32-bit mode. After extensive
investigation, including assembly output of characters to the serial port to
indicate how far the code got, and hand-comparison of disassembled code with
the original, it finally turned out to be a bug in the GNU assembler, mis-
assembling an expression with a forward-referenced symbol when in
.intel_syntax mode. The forward reference ended up becoming an unresolved
relocation (with a 0 placeholder) instead of the intended compile-time
constant, resulting in a wild pointer. It was one of the rare instances where
the bug really was in the toolchain, combined with an environment that makes
debugging a challenge.

The other such bug, in the 64-bit version of the same environment, involved
GCC compiling struct assignments into SSE instructions that assume aligned
addresses, and GRUB not actually aligning its stack for SSE because it never
actually used SSE itself and didn't happen to use struct assignments.
Debugging that one involved a quick hack of a general-protection-fault handler
that hex-dumped the bytes of code around the instruction pointer, searching
for those bytes in the compiled code, and matching that back up with the
disassembly and source code.

Most recently, I debugged a race condition in a build system, where disk image
manipulation (done by syslinux and mtools) was failing to obtain an flock file
lock. The kernel doesn't actually have any way to find out who holds the lock,
so I ended up instrumenting the flock syscall to print the conflicting lock
holder. Turns out that udev took a file lock on the loopback device as soon as
it showed up.

------
weland
The hardest bug I ever tracked to date resulted from a combination of me being
a n00b at the time and legitimately being hard. It was a stack thrashing bug
on an RTOS that ran on a system without MMU. To make things a little worse,
GCC support for that platform was still very early at the time, so GDB would
occasionally become confused, and did not support watches; besides, everything
had gotten big enough at the time that there was no way to compile the whole
system with debug symbols and no optimizations; the image was stripped and
optimized for size.

The bug wasn't easy to reproduce: all we saw was that, every once in a while,
when queried over $wirelessprotocol, the system would begin answering with
crap values (it was supposed to measure some physical quantities, and crap
values = meaningless, as in negative active power and hundreds of kV on a
mains line), and if you kept on pounding it, it would eventually start "acting
funny" \-- randomly toggling LEDs and handling commands that were never given
in the first place -- before eventually crashing. The problem was very far
removed from its core; at first, all I was debugging was "system begins
answering with thrashed values after a while".

I was two days into it when a more experienced colleague (I was a junior
developer at the time) stepped in to help me. We began suspecting a process
was smashing another process' stack when, after removing module after module,
the bug was still not clearly reproducible by a particular sequence of steps,
but the behaviour it triggered became fairly uniform.

We decided a good way to test this assumption was to modify the context
switching routine to dump the current top of the stack over a serial line;
unfortunately, that introduced additional delays that prevented the bug from
occurring, so it didn't help us. We figured, however, that the handler for
$wirelessprotocol's query was in the process that smashed the other process'
stack, so we modified that handler to send the top of the stack over wireless
(this is where not having a MMU helped, ironically :-) ). The _base_ of the
other process' stack could be obtained by just tracing context switches.

Sure enough, if enough commands piled up, that process (which was running some
pretty intensive stuff, including floating point operations, on a _very_
resource-constrained system) would smash into the next one's stack, messing up
its context's registers.

In retrospect, this wasn't necessarily a _difficult_ bug per se: the concept
is well-understood and the theory behind it is trivial. The biggest problem is
that it challenges the fundamental way we debug programs: when the CPU starts
doing crap, we assume we've instructed it to do crap, and it's (correctly!)
following consistently bad instructions. In this case, the CPU ended up
following random instructions.

------
pcvarmint
Answering your question.

------
jxs41u
The riemann hypothesis.

------
titzer
Register allocator.

------
MichaelCrawford
All three of Octel's servers would become unresponsive for no apparent reason.
Sometimes they crashed after a while but sometimes they would come back up
after a while.

Late one night when no one else was there I ran "top" only to puzzle over that
a bunch of identical command lines were consuming all the CPU:

    
    
        login -p Mkkuow....
    

I don't remember the exact username but this Mkkuow guy was trying to log into
all the terminals on each box.

I dont clearly recall how I figured this out but it was the result of
capacitive coupling - parasitic capacitance - between the transmit and receive
rs232 wires. The OS would transmit "SunOS login:" then get garbage on the
receive line. Then it would prompt for the password a few times, eventually to
give up and transmit the login prompt again.

The actual username I saw is easy to figure out by graphing the ASCII voltage
levels then considering how capacitance works.

The solution was to replace all the cables with a lower capacitance cable.
Because that required all new connectors as well as my time to install them my
manager Karen Coates required some convincing but in the end the new cable
stopped the hangs.

------
MichaelCrawford
I once found a mask error - a design flaw - in an embedded chip, but I was
unable to work around it. I had to tell my client, a primary defense
contractor that they selected the wrong part and would have to redesign their
boards then respin their prototypes.

------
rando289
I'm reading various I made company/site/device x able do y, which seems like
it would be 100x cooler if it was free for everyone to reuse and learn from.
Could y benefit science, government, medicine, or diverse software which
doesn't lock users into a platform, respects their privacy and self-
determination? If hacker news was free software, what other communities might
have sprung up and made the world better?

------
MichaelCrawford
I was a phone hotline volunteer for the Suicide Prevention Service of Santa
Cruz County, California.

Think about that the next time your code gets you down.

