
Why HN was down - pg
Hacker News was down all last night.  The problem was not due to
the new server.  In fact the cause was embarrassingly stupid.<p>On a comment thread, a new user had posted some replies as siblings
instead of children.  I posted a comment explaining how HN worked.
But then I decided to just fix it for him by doing some surgery in
the repl.  Unfortunately I used the wrong id for one of the comments
and created a loop in the comment tree; I caused an item to be its
own grandchild.  After which, when anyone tried to view the thread,
the server would try to generate an infinitely long page.  The
story in question was on the frontpage, so this happened a lot.<p>For some reason I didn't check the comments after the surgery to
see if they were in the right place. I must have been distracted
by something.  So I didn't notice anything was wrong till a bit
later when the server seemed to be swamped.<p>When I tailed the logs to see what was going on, the pattern looked
a lot like what happens when HN runs short of memory and starts
GCing too much.  Whether it was that or something else, such problems
can usually be fixed by restarting HN.  So that's what I did.  But
first, since I had been writing code that day, I pushed the latest
version to the server.  As long as I was going to have to restart
HN, I might as well get a fresh version.<p>After I restarted HN, the problem was still there.  So I guessed
the problem must be due to something in the code I'd written that
day, and tried reverting to the previous version, and restarting the
server again.  But the problem was still there.  Then we (because
by this point I'd managed to get hold of Nick Sivo, YC's hacker in
residence) tried reverting to the version of HN that was on the old
server, and that didn't work either.  We knew that code had worked
fine, so we figured the problem must be with the new server.  So
we tried to switch back to the old server.  I don't know if Nick
succeeded, because in the middle of this I gave up and went to bed.<p>When I woke up this morning, Rtm had HN running on the new server.
The bad thread was still there, but it had been pushed off the
frontpage by newer stuff.  So HN as a whole wasn't dying, but there
were still signs something was amiss, e.g.  that /threads?id=pg
didn't work, because of the comment I made on the thread with the
loop in it.<p>Eventually Rtm noticed that the problem seemed to be related to a
certain item id.  When I looked at the item on disk I realized what
must have happened.<p>So I did some more surgery in the repl, this time more carefully,
and everything seems fine now.<p>Sorry about that.
======
DanielBMarkham
Amazing that such a large percentage of debugging involves determining exactly
_what_ you are debugging. The definition of the problem, many times, is the
solution.

Might be a good time to mention Rubber Duck Debuggging.
<http://en.wikipedia.org/wiki/Rubber_duck_debugging>

~~~
z-e-r-o
For me, this should be called stackoverflow debugging. I genuinely solved a
lot of my problems by trying to write a _good_ question on SO about my
problem. The problem seems really difficult when I try to ask it in one
sentence, just out of my head. However once I try to describe the background,
what I'm trying to achieve, what I'm using, when does the problem happen,
simplified down to sub-cases, usually by the time I'd be 80% ready with
writing the question, I realize the answer.

~~~
Semaphor
That happens to me a lot. Most of the time I just formulate the question I
have in my head into something coherent and by that point I either have the
solution, know what to search for or, in case it's not a question but a
comment, I realize it's not worth saying.

------
lifeisstillgood
There are a number of comments that add up to "what steps will you take to
ensure this does not happen again" - akin to a incident review. As speculation
that's fine, as advice, I don't think it _should_ be listened to.

I am reminded of an long-in-the-tooth sysadmin of my acquaintance who logged
in everywhere as root. His theory - "they are my boxes. I screw it up, I fix
it." I eventually realised that typing sudo every time he touched a box was no
defence against doing the wrong thing.

An awful lot of sites at 1.2m views would have outsourced the running and
development of the whole thing - there are entreprenuers who say its not even
worth our time to code up the MVP. I find this approach sensible from a
business point of view, but still it does not sit right with me.

I am supposed to have a nice website with lots of good content to attract
inbound marketing - so I tried getting someone on textbroker to write an
article for me. It read like a High School essay - no life, no anime. And so I
will probably write my own CMS and my own content.

And pg sits there and writes his site in his own language, with his own
moderation tools. Apart from the hilarious idea he could find a ten person
ruby shop to outsource to, its nice to see someone taking the time to play
again. Its why I like to see jgc on here too.

I am not entirely sure those thoughts are joined up (I am procrasting like
crazy) but if they come to mean anything its we are playing in pg's sandbox.
If the sand leaks it's his sand, and the only company this is mission critical
to is YC.

~~~
luser001
> I eventually realised that typing sudo every time he touched a box was no
> defence against doing the wrong thing.

IIRC, sudo logs all commands to syslog. Which might come in handy. Yes, root
commands will be logged by bash to .bash_history, but there are limits of # of
commands lots, what happens if you are logged in multiple times into same
account etc.

Anyway, that's why I like sudo.

~~~
wpietri
Beyond the logging, which I love, I use it for the differentiation of states.
I'm just a little more attentive when I type "sudo" before something.

At work, a relatively young engineer accidentally typed a command meant for a
test database server into a production window. There was a big rush to restore
from backups, and there was a small amount of data loss.

One thing that came out of the retrospective, requested by the engineer in
question, was the _production hat_. Before you opened up a connection to a
production machine, you had to put on the large pirate hat. You could only put
it back when you had closed the connections. I didn't really need it, but it
was a great way for people to learn the necessary caution.

It also ended up being a nice exclusive lock on futzing with production, and
seeing it in use led to some good discussions that otherwise might not have
happened. But the main thing was developing a strong differentiation of states
in everybody's heads.

~~~
darkarmani
Another cheap solution is colorizing the prompt red for dangerous consoles and
green for dev machines. This makes it very easy to notice when you selected
the wrong terminal.

------
dasil003
I'm not sure whether it's terrifying or relieving to realize that if all I
dream of comes to pass and I achieve something akin to the legendary status of
pg in the hacker community that I will still be susceptible to the inevitable
facepalm moments that come with direct database access.

In any case I am thankful for the detailed explanation.

~~~
larrys
Some of the most spectacular airplane crashes are by the most experienced
pilots.

If you've ever tried something new as a hobby you tend to be very careful.
Once you gain confidence you take more chances and don't do what even a
beginner might do.

~~~
stcredzero
There must be a rare personality type that never experiences this kind of
overconfidence. Perhaps a less glamorous cousin to the Buddhist beginner's
mind?

~~~
larrys
I find that if I am doing something "dangerous" (example might be using power
tools) I have to say to my self "be careful this is dangerous" to avoid being
on autopilot and making casual errors. Maybe a better example is the way you
train yourself after you've picked up a box the wrong way and pull something
to try to remember each and every time to watch your specific movements.

~~~
kamjam
But at some point, you will become complacent, and it will taken a mistake to
remind yourself again.

We've all done it. I shut down an NT4 production server because I was
connected via remote desktop and clicked shutdown rather than log off. This
was back in the day when there was no pop-up asking for reason you want to
shut down and confirmation.

Luckily it was just our internal intranet server!

~~~
yuhong
You were running NT4 Terminal Server Edition?

~~~
kamjam
Yes, i think so. It was so long ago and it was my first programming role!

~~~
yuhong
AFAIK that edition had logoff instead of shutdown on the Start menu for that
reason. You can still access shutdown by hitting Ctrl-Alt-Del on the console
or clicking Windows NT Security.

~~~
kamjam
It was so long ago I have no idea. I defo vividly remember that I shut it down
via the start menu, just one of many moments that stick out :)

------
sehugg
Great postmortem and good lessons to learn here:

* Don't manually modify database without a well-tested procedure and another pair of eyes

* Don't leave persistent problems (e.g. memory problems) uninvestigated so that you miss new problems with similar symptoms

* Don't push new code to production while operational problem is ongoing (unless it addresses the operational problem)

I'm pretty sure I've repeated this exact same sequence before with similar
results...

~~~
zzzeek
use CHECK constraints to prevent invalid data patterns when possible.

~~~
erichocean
Sadly, even that isn't enough.

In our production database, I used CHECK constraints religiously. Worked
great.

Then one day, I was no longer able to commit ANY transactions to a particular
table, even completely innocuous ones.

The problem? The database _itself_ had violated its own CHECK constraint on a
previous commit, but was _enforcing_ it on all subsequent commits, causing
them to fail. Brilliant.

Moral: not even CHECK constraints will save you.

\----

P.s. This was a proprietary database, and when I reported the problem to the
vendor (eventually, I figured out how to reproduce it), the vendor actually
_refunded our (expensive) support contract_ rather than fix the bug -- they
couldn't figure out how to fix it despite having a small bug report that
reproduced the problem.

In the end, I actually had to remove the CHECK constraint altogether. :(

~~~
zzzeek
> The problem? The database itself had violated its own CHECK constraint on a
> previous commit, but was enforcing it on all subsequent commits, causing
> them to fail. Brilliant.

I've never heard of that, and unless you're using a really buggy, broken
database, it should not be possible.

> P.s. This was a proprietary database, and when I reported the problem to the
> vendor

well there you go. I think in practice, a simple CHECK constraint like the one
we'd do here (literally, comment_id > parent_comment_id) is pretty easy to put
one's faith into.

------
neurotech1
This should serve as a example template for how to accurately and
transparently explain to users what went wrong. No deflecting blame, no
useless platitudes.

Credit to PG, RTM and the rest of the team for keeping the sites uptime as
high at it is.

~~~
jgrahamc
"No deflecting blame"

Who were they going to blame?

~~~
wpietri
He could have blamed the new server. Or whatever distracted him. Or the user,
for being dumb. I've seen people do all of those.

Or he could have just dodged the blame entirely.

~~~
dguaraglia
Some people are just freaking difficult to work with. I've worked with people
that wouldn't accept responsibility even after every other possible cause was
ruled out. I've even gotten this reply: "well, you must have been unlucky to
get the faulty e-mail, because it seems to work most of the time". Yeah,
because that's how programming works: cowboy coding and hoping for the best.

This guy actually called bugfixes "optimizations": "hey, X the feedback widget
on the front page isn't working", "oh, yeah, I haven't worked on that because
that's code that needs to be 'optimized', so it's low on my issues list". Ugh.

I've learned my lesson now. In fact, that's an incredible lesson for a startup
founder: never, ever, hire someone who dodges a question on an interview. And
the first time they avoid taking responsibility for something that was clearly
their fault, fire them. The last thing you want is someone who'll blame
everyone and anything else for their issues. It's a great way to kill morale
and create rifts in a small team.

~~~
paganel
> And the first time they avoid taking responsibility for something that was
> clearly their fault, fire them.

I guess there's still difficult for a lot of people to acknowledge their own
mistakes, maybe because they're afraid of getting fired for that
(acknowledging the mistake), which in the case of startups/small companies
happens very rarely.

From my own experience of working at startups for my entire professional
career as a programmer (7 and a half years) I can tell you that the first step
when noticing you f.cked something up is to take immediate responsibility and
then asking yourself "how can I/we fix this?" (you might need the help of
other people to fix your mistake). After you've fixed the issue the question
should be "how can we make so that this doesn't happen again?". That being
solved I'd say nobody cares anymore whose fault was it to begin with, there's
always other more important stuff to do.

I agree that maybe at larger companies this kind of thing might happen exactly
the opposite way, i.e. you can get fired for making a mistake and nobody
really cares to fix other people's stuff, because their next
paycheck/financial well-being does not depend on that (or so they think).

~~~
dguaraglia
Exactly! When everyone is on the same boat the priority is _fixing_ the stuff,
then wondering whether attributing responsibility is important (most of the
time it isn't. Who cares who fucked up the e-mail template, as long as it's
fixed next time it runs.)

I think in this guy's case the causes for his reluctance (or rather,
incapability) to accept his own mistakes had much deeper roots. He was
literally the most self-centered person I've ever met, to the point that he
wouldn't accept anyone's opinion on anything. He went out of his way to find
doctors that'd go with his suggestions and run all kinds of tests on him to
determine why he had a blood pressure problem, when he was clearly _way_
overweight and had the unhealthiest diet I've ever seen. He'd dress up in
shorts and t-shirts during the worst days of winter, and then take a niacin
pill to force a capillary rush so his hands wouldn't feel cold (?)

Basically, he just thought the world had to bend to his will. Why use common
sense, when you can just say "fuck it" and find a workaround that fits your
mindset. Of course you can't expect someone like that to 'accept' his own
shortcomings.

The scariest part of all this is he tried, for a while, to become a cop. Yep,
imagine that: a 240lbs armed prick, completely unable to reason, forcing his
way on everyone. I shudder at the thought.

~~~
lotyrin
> imagine that: a 240lbs armed prick, completely unable to reason, forcing his
> way on everyone. I shudder at the thought.

Where are you from where that's something you have to _imagine_ , because I
want to move there.

------
larrys
"But then I decided to just fix it for him by doing some surgery in the repl."

I've always found it's a good idea to not deviate. Whether it be running,
parking or anything else once you deviate from some regular behavior you run
into potential problems that you hadn't anticipated.

"For some reason I didn't check the comments after the surgery to see if they
were in the right place. "

More or less my point. If this wasn't a deviation from normal behavior you
would have "checked the comments after the surgery" because it would have
either become habit or the shear number of times you tried a fix resulting in
an error would have made that more likely to occur.

~~~
irahul
> I've always found it's a good idea to not deviate.

Aren't you assuming "surgery in repl" is a deviation? What if it's normal
course of action for him?

> More or less my point. If this wasn't a deviation from normal behavior you
> would have "checked the comments after the surgery" because it would have
> either become habit or the shear number of times you tried a fix resulting
> in an error would have made that more likely to occur.

How about the opposite scenario? He has done it so many times with desired
results, that he didn't bother checking?

~~~
badgar
> Aren't you assuming "surgery in repl" is a deviation? What if it's normal
> course of action for him?

This is a big difference between engineering and hacking. An engineer would
never _regularly_ do something so dangerous.

But I suspect pg isn't an engineer when he works on HN, I suspect he is a
hacker, and just does whatever he wants to, whenever he wants. Which is his
prerogative.

------
tolmasky
Why do "self posts" like this show up in the same light gray as posts with
negative vote counts? My eyes aren't great and I find it hard to read

~~~
emillon
The rationale for this is that if you need to post a long text post, it should
be in the form of a blog post instead. I agree with you that it's not really
adapted for a meta post.

------
irahul
Disclaimer: Hindsight is 20/20, and stuff.

If reverting code didn't fix it, reverting server didn't fix it, incorrect
data is the most likely culprit(I am not claiming this should have outright
occurred to you; just thinking out loud). I take it you introduced non
terminating recursion by making a thread its own parent, and you made the
change on disk.

But this analysis is the last thing that comes to mind when you already have
introduced 2 new variables the same day - new code, new server. And an old,
recurring variable(GCing too much) is in play as well.

------
benatkin
So what do you do to avoid this in the future? Do you stop doing surgery in
the repl, or do you do the surgery with functions that check for cycles from
now on?

~~~
badgar
> So what do you do to avoid this in the future?

It's HN... there's no SLA, there's no postmortems, there's no doing things
better in the future. pg just runs this site out of the good of his heart, we
should be lucky the volunteers run it for us at all.

~~~
oh_sigh
> pg just runs this site out of the good of his heart

Hilarious. I would have believed you if you appended "and his wallet"

~~~
badgar
One core. One HD. Bandwidth is trivial with no images. How much do you think
this site costs to run?

~~~
jlgreco
He means that PG runs the sight because it makes business sense, not out of
charity.

------
Legion
"We'll do it live!"

~~~
jbuzbee
"Hey, Hold my beer and watch this!"

------
gruseom
This is a particularly endearing piece of "hacker news". It's so easy to
relate to.

------
lucb1e
Are you saying you manually modify the database? Like, shifting around things
by id instead of just making admin buttons next to posts?

~~~
irahul
HN runs on plain files. He wasn't modifying database, but calling functions(I
believe) in the repl to change the parent id of the thread.

But that apart, even if there were an admin button to change the parent id of
a thread, he would still have made the same mistake.

Unless the code in question was checking for loops. In that case, repl would
have worked the same.

~~~
lucb1e
I sort of meant that you shouldn't modify things like that directly. Be it a
filesystem, database, or any other place that makes it possible to mess things
up to bring a rather strong server down.

~~~
irahul
I see where you are coming from. But I am saying this didn't happen because he
did things live. This happened because he entered incorrect id making a thread
its own parent(or grandparent; doesn't matter).

This is the kind of mistake one would make even if you were writing proper
migrations. He was doing things live isn't an issue; neither is an incorrect
id. The issue is the code doesn't check for loops.

~~~
chernevik
I am but an egg, I have two questions.

One, if the data were held in a database, should a change like this be
captured in the database logs? I am seeing more and more situations where I
want these, I notice that they are by default turned off for mysql and wonder
if this reflects a de facto judgment that logging slows performance more than
is usually worthwhile.

Two, if the data were kept in a database, wouldn't something like this be
prevented by a constraint preventing a comment from making itself an ancestor?
But I suppose there is a slight performance hit in checking such constraints,
and the case arises so rarely that this hit isn't generally worthwhile.

~~~
lmm
Databases, at least the SQL kind, really aren't good at dealing with
hierarchical data, and I don't know how you'd even begin to express that kind
of constraint. I don't think a traditional database is the answer here. (If it
were me, once I'd done it more than twice I'd write a "move thread" admin tool
in the UI, and after I screwed it up like this I'd have a place to add such a
check to).

~~~
mr_luc
If you were using some kind of representation for Nested Sets -- left-to-right
depth-first numbering, or a human-readable id.id.id chain -- then it's really
easy to write a constraint for that: parent left < myleft, right > myright, or
dotted_id.split('.').filter{|first, rest| return false if rest.contains first}
(yeah, yeah, that second pseudocode would be unrealistically PITA for some
DBs).

More generally:

I'm not a big SQL wonk anymore, but I find a lot of people have the intuition
that relational databases are ill-suited for trees.

An intuition that is much closer to the truth is that almost all databases can
handle trees pretty well, because there's still an unambiguous concept of
ordering and containment, and you can usually arrange things so as to do
range/ancestor/inclusion queries efficiently.

It's graphs with loops/without unambiguous concept of ordering/containment
that are really hard.

~~~
EEGuy
Found this: The excellent Postgres documentation includes an SQL graph search
with two different ways of graph cycle checking, here:
<http://www.postgresql.org/docs/9.0/static/queries-with.html>

One way involves accumulating an array of nodes already visited as the tree
gets walked, checking each node as-visited for membership in the array-to-
date.

The other method, a bit more of a hack, is just adding a LIMIT clause.

I think the 'WITH' clause is a great addition to the SQL standard, very much
worth the learning the weirdness of its syntax and its optional 'RECURSIVE'
term (which, as the Postgres documentation points out, isn't really recursion,
it's iteration).

------
birken
Do you have munin monitoring on the production HN server?

That would really make situations like this easier to debug. First, it can
pinpoint exactly when something started happening, which in this situation
might have helped you realize the problem was caused by your change. Secondly,
in this specific situation it probably would have been easier to differentiate
a situation where you are running low on memory vs this completely different
situation.

As somebody who spent a lot of time professionally debugging large software
systems when they were misbehaving (as a Google SRE), I can tell you that
looking at graphs of many key metrics (disk IO, CPU, memory, then application
specific things) was always the place to start when debugging a situation,
because you can learn so many things right away. When did it start? Was it a
slow buildup or an immediate thing? What is the general problem (Memory?, Disk
IO?, CPU?, none of the above?)? Has a similar pattern happened in the past?

Then you can start to get fancy and plot things like "messages/minute" or
something and then it becomes easy to see when issues are affecting the site
performance and when they aren't.

~~~
stcredzero
That and something like the Smalltalk Change Log would have made this a no-
brainer debug. (Yes, every REPL action in Smalltalk got logged by the same
mechanism that logged every code change.) Such mechanisms aren't trivial, but
they're not rocket science either, and they have tremendous ROI.

------
znowi
I wonder what exactly did distract you :) When I do surgery on a production
server, I triple-check making sure everything works properly.

I have two assumptions: 1. HN has a low priority in the overall scheme of
things, 2. Self-confidence overflow :)

------
nowarninglabel
Happens to a lot of us. Great reason to always write tested cleanup scripts
for this stuff instead of editing directly on the server. The only time I
brought down my product last year was from a similar screwup, I was removing
users by hand and somehow managed to end up with a 0 in my list of user ids,
thus deleting the anonymous user, and causing havoc to my server, which took a
long time to track down.

------
dap
Thanks for the detailed explanation.

It sounds like everything was done to fix the problem _except_ try to figure
out what the problem actually was. Why not use tools to see what the program
is doing, form a hypothesis, gather data to confirm or reject the hypothesis,
repeat until cause found, and then take corrective action that by this point
you have high confidence will work?

I realize HN is more of a side project than a production service, but the goal
is the same in both cases: to restore service quickly so you can move on to
other things. It feels like a more rigorous approach would allow restoring
service much faster than randomly guessing about what could be wrong and
applying (costly) corrective action to see if it helps.

Besides that, in many cases (including this one), you cannot randomly guess
the appropriate corrective action without finding the root cause.

------
luser001
I use assertions to protect against things like this.

I liberally sprinkle my code with assertions (CS theory calls them pre-
conditions and post-conditions, iirc) to crash early if the system is an
invalid state.

One my pet peeves is that few programmers seem to love assertions like I do.
Would love to see to comments on this.

~~~
timothya
What assertion would you have used in this case? For every comment you'd have
to iterate through all it's parents to check if there is a cycle, which seems
pretty inefficient to do for something that should never happen (there are
other ways that you could check for this problem as you go, but the only other
ways that I can think of require holding extra state just in order to perform
the assertion).

I'm for assertions when they are simple and don't cost much (especially during
development), but it's not feasible to check every condition that should not
happen.

~~~
petercooper
You could assert a limit on depth, perhaps. Then the cycle would still exist
but after X number of comments, the rendering ends.

~~~
timothya
This is a reasonable solution. While it will (almost) never provide the
correct result (it might print out a cycle of comments until X is reached, or
it might cut off a very long but legitimate comment thread), it would provide
a reasonable guarantee on this sort of problem not generating infinite pages.

~~~
petercooper
At the risk of being accused of flame-baiting, I'd say it's the engineering
solution rather than the mathematical one.. ;-)

For some reason I tend to be a fan of the "stick it in a secure box" rather
than "get it right in the first place" approach..

------
d0m
Hacking code in the repl without testing the new behavior. We all did that.
Don't lie. Once I wanted to quick fix a "gmail.ca" to "gmail.com", which I
did.. but to all the users instead of just the one mistaken. Fortunately I
realized by mistake really fast ;-)

------
IgorPartola
The pink sombrero could have saved HN: <http://www.bnj.com/cowboy-coding-pink-
sombrero/>

------
Uchikoma
Appreciating the details.

"Hacker News was down all last night."

With the internet there is no "last night" ;-) Europe - and more so Asia I
assume - had to live for many working hours without HN.

~~~
pramodliv1
Yeah, I was more productive yesterday. But I did read google cached versions
of HN.

------
scotthtaylor
PG, quick question: Did this impact the server hosting the YC Summer 2013
applications?

When I tried to edit mine, it simply said "Thanks, scotthtaylor"

~~~
scotthtaylor
Working again now.

------
fnordfnordfnord
>>I caused an item to be its own grandchild.

Please forgive me. I know you folks tend to hate jokes on here. Don't waste
your time if you're immune to corny humor. "I'm My Own Grandpa- Ray Stevens" (
with family tree diagram) <http://www.youtube.com/watch?v=eYlJH81dSiw>

------
robomartin
Great story! Yup, this kind of thing happens. For some reason it reminded me
of something that happened to me as a newbie engineer. It was really funny a
week later.

I was troubleshooting an intermittent problem in a piece of equipment. It had
several boards full of mostly LS TTL logic chips (yes, them chips). It was the
kind of problem that only happened once every other day or two. Nobody knew.
So, I had all kinds of instruments attached to this thing and was watching it
like a hawk waiting for a failure. It had probes attached to every point in
the circuit where I suspected I could see something and learn about the source
of the problem. I also tested for thermal issues with heat guns and freeze
sprays, familiar troubleshooting techniques to anyone who's done this kind of
thing.

Anyhow, every so often the thing would go nuts. The three scopes I had
connected to it showed things I simply didn't understand. I'd analyze but
couldn't make any sense out of it. Still, again, every so many days it would
happen again. Changed power supplies and the usual suspects. No difference.

Well, finally, two weeks later, the other engineers in the office took pity on
me and told me what was going on: They had connected a VARIAC to the power
strip I was using to power the UUT (unit under test). The scopes and other
test instruments remained on clean power. Every so often they'd reach into
this drawer where the VARIAC was hidden and lower my power strip's voltage
just enough for the power supplies to fall out of regulation and everything
start beeping and sputtering. Those friggin SOB's. They had me going for days!
I was pissed beyond recognition. Of course, after a while I was laughing my
ass off alongside them. Good joke. Cruel, but good.

My revenge: A CO2 fire extinguisher rigged to go off into his crotch when my
buddy sat down to work.

Fun place to work. We did this kind of stuff all the time. Today I'd be afraid
of getting sued. People have really thin skins these days.

------
DanI-S
n.b. that this is why time travel is a _terrible_ idea.

~~~
cmaggard
The grandfather paradox is the best solution to the halting problem.

------
louischatriot
Funny to see that this happens to everyone. A week ago, while testing some
stuff to locate a low-importance bug, I erased the whole user database.
Fortunately we have a good restore so the problem was solved in a few minutes,
but still, cold sweat here ...

------
neilxdsouza
Isn't it curious that the comet incident over Russia happened so close to the
pass of DA 14. In the intro to the book:

<http://ruby.bastardsbook.com/about/#why>

is the note about surgical instruments left inside. It seems just like a
coincidence that this happened so close to the switch to the new server, but I
wonder if it's something deeper in the subconscious mind; the change to the
new server is quite a big change (I know I feel that way when I have purchased
a new computer (it feels different - even if it's running the same linux as
before)) and could have upset the normal checks one has in place when tweaking
things.

------
cool-RR
Great debugging story!

I guess the lesson is to have code that alerts you about comment loops without
going into an infinite loop.

Also another lesson would be to figure out a way to have better clarity into
which requests are causing a timeout on the server.

------
raheemm
I'm curious how was RTM able to notice that the problem seemed related to a
specific item id? It would be great if he might write a short blurb similar to
yours. Which also makes me wonder, why does RTM not write much?

------
Posibyte
I absolutely love post-mortems like this. It clearly identified that there was
a problem, what the author tried to do to fix it, and if it was successful.
Even if it ends with the author not knowing too much about the solution that
was used, it's still so interesting to see the workflows and be able to derive
something from it.

It's also why I like to read pg's articles so much. They're so in-depth and
detailed and it doesn't feel you left thinking something was left out for the
sake of being hidden.

------
rnadna
I fall into a similar misdirected-focus trap, but mine is simpler: I waste an
embarrassing amount of time in editing the wrong damned file. After a sequence
of small tweaks that yield no change in the results, I make a huge change and
see nothing, and then realize that I've done it yet again. I need to write a
vim macro that blanks the screen every few minutes, displaying the message
"are you SURE this is the right file?"

------
mikedmiked
> created a loop in the comment tree; I caused an item to be its own
> grandchild.

Ah, the online forum equivalent of going back in time to kill your
grandfather.

------
RKoutnik
It's nice to know that even the mightiest of us can still make mistakes.
Thanks for being willing to admit mistakes so the rest of us can learn.

------
johnobrien1010
Thanks for fixing it.

Have you considered avoiding dipping into the repl to do these kind of fixes?
You don't owe any of us any sort of uptime guarantee, and you're a much better
programmer than I, but it strikes me as odd that you would hack against the
live server instead of create some tool that would make it so you couldn't
take down the whole site when making this kind of fix...

------
sideproject
Thank goodness it's back. I lost the my meaning of existence for the entire
day. I don't know where my yesterday went. I'm ok now. :)

------
corwinstephen
It's never what you think it is. One time, I had a memory leak in a Rails app
that took me TWO WEEKS to find. In the end, it came down to me putting a line
of config code in the wrong section of the config file, which for some reason
created a recursive loop and caused my servers to crash about once every 30
minutes. #weak

------
xentronium
Whoa, what an unfortunate coincidence. This whole bug would be so much easier
to find, if it weren't for the new server.

~~~
lmm
The bugs that actually hit production are always like this - a confluence of
three or so factors - because if it were simpler you'd have caught it earlier.

(Though I have to say, upgrading the code at the same time as you're
restarting to fix a problem is really a rookie mistake. It's incredibly
tempting because it saves so much time, but if you do it you _will_ get it
wrong sooner or later. One of the hardest skills in programming is acquiring
that zen that you need to wait in a state of readiness for the effects of your
first change to make themselves apparent, rather than changing something else)

------
GnarfGnarf
That's funny -- I work in genealogy software, and loops ("being your own
grandpa") happen all the time, due to data entry errors. To avoid infinite
recursion, we always keep track of what records we've processed already, check
whether "I've been there before", and bail out if the answer is affirmative.

------
cranklin
You are honest and I respect that. I'm sure many companies try to play off
their downtime as something far more sophisticated when in fact, it was
something too embarrassing to admit. I've certainly had my fair share of
embarrassingly stupid mistakes that resulted in downtime.

------
richforrester
Cheers for that pg - now I have to explain to my boss why I was _actually_
productive yesterday.

------
aaronh
My pet peeve: You made an arbitrary change while debugging a problem. NOW YOU
HAVE N^2 PROBLEMS!

------
rjempson
That is why some organizations don't allow adhoc data fixes to be run in
production. Best practice is to backup the database, run the fix against the
backup, test the fix against the backup, and all being well run the fix
against production.

------
T-zex
Thank you for the honest explanation. This is not so easy especially for a
famous person.

------
bramcohen
You should probably make your code robust to this sort of data corruption in
the future.

------
infoseckid
"I don't know if Nick succeeded, because in the middle of this I gave up and
went to bed." - Not a good example to your holding companies :) What would
happend if they all went to bed when something goes wrong :) Just kidding.

------
harrisreynolds
Just about anyone that has programmed for any length of time has done
something like this. It is one of those "fixes" that after it's actually fixed
you try to never think of it again. Good to know PG is mortal. :-)

------
ricardobeat
Related question: what is the timing for the 'Reply' link to show up? I might
be fantasizing but sometimes it takes 5, sometimes 10 minutes to appear,
leading people to reply as a sibling instead.

~~~
DanBC
Deeply nested comments tend to be hot, and so the reply link takes a while to
show up to try to give people some time to think about what they're going to
say.

~~~
ricardobeat
Interesting. So it slows down discussion as it progresses, until it either
stops, is forgotten or becomes a series of long essays.

------
carpathios
Seems related: [http://meta.stackoverflow.com/questions/66377/what-is-the-
xy...](http://meta.stackoverflow.com/questions/66377/what-is-the-xy-problem)

------
ibudiallo
When hacker news is down, I finally lifted my head and realized that there is
life beyond the screen on my phone.

Now that it's back. I realized that it's finally time to create an account :)

------
nournia
It seems that in your new server and also latest pushed code, I can't do
`like` anything. Honestly it's not a new bug and I got used to that, don't
think about that.

------
btilly
That explains something weird I saw.

If I went to Google's cached copy, I could see threads, and then click on
them. But the front page was down. But I could see individual threads.

Very confusing.

------
dennisgorelik
Did you add code that detects very deep nesting levels (e.g. depth more than
100) and throws meaningful exception to help developers to diagnose the
problem?

------
hnriot
it's a good job it's your site, this type of thing is often what gets someone
fired in a company. Modifying (meddling!) the production system directly.

~~~
JohnBooty
I disagree. Very few companies would think negatively of an engineer if they
made such a mistake on a non-essential, non-revenue-generating fun/research
project.

How many dollars did YC lose because of the outage? None. (Maybe they saved a
few on bandwidth!)

I also predict that exactly zero startups will say, "Man... I'm not going to
take seed money from _those_ guys! They had discussion forum downtime."

~~~
hnriot
They could save even more if they shut it down! That's a ridiculous thing to
say. We could all save money that way.

You've obviously never worked in a for profit corporation. in such there are
policies and practices put in place to prevent just this kind of newbie
mistake. You never modify the live database directly. Never ever. Whether it's
a bottom line property or not.

I didn't say it would negatively impact YC's business. It might make them look
incompetent, but these things happen, but people don't approach YC for their
website savvy, they go there for the money and the connections. Most of the VC
firms i've EIR'd at have much worse IT than hn. Their sites are barely usable.
It seems to just go with the territory.

Let's not be so defensive, PG can do what he likes with his site, including
take it down whenever he feels like saving bandwidth. But in the real world
these kinds of things get real people on a fast track to their exit interview.

~~~
JohnBooty
"You've obviously never worked in a for profit corporation"

Wow, really? I don't think that attitude is warranted at all.

At any company (for-profit or otherwise) there is a finite amount of time and
money -- and surely we can agree that solid development/deployment practices
carry an upfront time/money cost, can't we?

In an ideal world, all projects would have continuous build processes,
automated tests, and management tools extensive enough to render live database
surgery unnecessary.

Perhaps you've worked at companies so flush with cash that every single line
of code, research project or otherwise, has gone through rigorous
development/testing/deployment practices. If so, I'm jealous. I've always
worked at companies that had to be choosey about how they spend their
resources.

------
patrickwiseman
Don't worry I just figured out the totally bone-headed programming mistake I
made at noon today. Time is a good mediator between skill and stress.

------
sgt
Much appreciated, pg. I knew that the "10 minutes of downtime" would not occur
(fair enough, this was not related to the server upgrade).

------
calinet6
Ok, I'll just say it: that's just plain dumb. It's a rare case, but a simple
condition would have checked and prevented this. :)

------
mempko
Did you hear about the tortious and the hare?

------
DrJosiah
Everyone fat-fingers a database at some point... Then you build interfaces so
that you can't make the same mistake.

------
Jplenbrook
Why does PG maintain the website himself? I would think he would have many
better things to do with his time.

------
pilas2000
That's funny because one of the top posts in progit yesterday was about the
Hare and Tortoise algorithm

------
dylangs1030
Thanks for the explanation pg. As you said in the original thread, "you know
how these things go..."

------
campnic
The nice thing about surgery with a computer program on a server is that death
is not permanent.

------
orangethirty
It makes me feel good knowing better programmers than me go through the same
issues I face. :)

------
thedaveoflife
I think this demonstrates how many people browse the /threads?id=pg page
(myself included)

------
meshko
TIL there are still large web sites out there that do not have staging
environment.

------
andreasklinger
I appreciate (if not love) the fact that you bugfix and server-change
yourself.

True hacker spirit.

------
blantonl
_Sorry about that._

No worries.

So, are we back on the new server? Or was this too much for one transition :)

------
scotthtaylor
Normality has returned :-)

------
Nux
I was almost sure it was Anonymous! ... Are you in Anonymous, pg? :D

------
afshinmeh
Same problem in Iran, I couldn't access to HN all last day.

------
wpeterson
I guess it's time for NewRelic to add an Arc agent.

------
w_t_payne
That sort of thing is fine for a startup in it's first year or two of life,
but HN has been around for a while now ... surely you must have some sort of
process by now?

------
bestest
So, uh, still fixing stuff in production?

------
cincinnatus
The cobbler's children have no shoes :-)

------
dahumpty
pg,

Just wondering as to why HN isn't hosted in the cloud? (e.g. on AWS, Rackspace
etc.). How do you backup all the data?

~~~
wtracy
Because that would cost more?

I don't really know what the benefit of cloud hosting would be in this case.

------
nigo
I appreciate pg's frankness here.

------
keikun17
i hope that user wasn't me. i was editing a typo in comment right when it
happened

------
DocG
I think we have a new king!

Awesome explanation.

------
eluos
"I am my own grandpa"

------
arundelo
Even Homer nods!

------
youngerdryas
>On a comment thread, a new user had posted some replies as siblings instead
of children. I posted a comment explaining how HN worked. But then I decided
to just fix it for him by doing some surgery in the repl.

No good deed goes unpunished!

People sometimes reply as sibling because they too impatient wait for the
built-in delay on child comments.

Thanks for keeping the experiment going.

------
naturalethic
If the problem existed before the code update, why would you assume it was the
code update that caused the problem?

------
bobsoap
After breaking many things myself due to similar, seemingly miniscule edits, I
have implemented an ABC routine: Always Be Checking. Even if it was "just"
something like moving a piece of code or something equally tiny, I always
check after the fix.

So far, it has been working great.

------
jack57
Are you sure that comment's name wasn't Phillip J Fry?

~~~
jack57
Apologies for the trivial comment

