Hacker News new | comments | show | ask | jobs | submit login
Why HN was down
1049 points by pg 1591 days ago | hide | past | web | 287 comments | favorite
Hacker News was down all last night. The problem was not due to the new server. In fact the cause was embarrassingly stupid.

On a comment thread, a new user had posted some replies as siblings instead of children. I posted a comment explaining how HN worked. But then I decided to just fix it for him by doing some surgery in the repl. Unfortunately I used the wrong id for one of the comments and created a loop in the comment tree; I caused an item to be its own grandchild. After which, when anyone tried to view the thread, the server would try to generate an infinitely long page. The story in question was on the frontpage, so this happened a lot.

For some reason I didn't check the comments after the surgery to see if they were in the right place. I must have been distracted by something. So I didn't notice anything was wrong till a bit later when the server seemed to be swamped.

When I tailed the logs to see what was going on, the pattern looked a lot like what happens when HN runs short of memory and starts GCing too much. Whether it was that or something else, such problems can usually be fixed by restarting HN. So that's what I did. But first, since I had been writing code that day, I pushed the latest version to the server. As long as I was going to have to restart HN, I might as well get a fresh version.

After I restarted HN, the problem was still there. So I guessed the problem must be due to something in the code I'd written that day, and tried reverting to the previous version, and restarting the server again. But the problem was still there. Then we (because by this point I'd managed to get hold of Nick Sivo, YC's hacker in residence) tried reverting to the version of HN that was on the old server, and that didn't work either. We knew that code had worked fine, so we figured the problem must be with the new server. So we tried to switch back to the old server. I don't know if Nick succeeded, because in the middle of this I gave up and went to bed.

When I woke up this morning, Rtm had HN running on the new server. The bad thread was still there, but it had been pushed off the frontpage by newer stuff. So HN as a whole wasn't dying, but there were still signs something was amiss, e.g. that /threads?id=pg didn't work, because of the comment I made on the thread with the loop in it.

Eventually Rtm noticed that the problem seemed to be related to a certain item id. When I looked at the item on disk I realized what must have happened.

So I did some more surgery in the repl, this time more carefully, and everything seems fine now.

Sorry about that.




Amazing that such a large percentage of debugging involves determining exactly what you are debugging. The definition of the problem, many times, is the solution.

Might be a good time to mention Rubber Duck Debuggging. http://en.wikipedia.org/wiki/Rubber_duck_debugging


A few times a month, I'll look up at one of my colleagues and say, "hey, got a sec? I need to talk to the duck," and they know this means I'm going to talk to their head but they can basically keep doing what they're doing and nod occasionally.

This serves several purposes:

(1) It's less insane-sounding than actually talking to an inanimate object in an open work environment.

(2) It actually feels better and forces me to think more clearly when I'm talking to an actual person -- the cognitive focus is higher when the object of conversation can actually, in theory, think and talk back (YMMV).

(3) And finally, although it does require some focus on the part of the other coder, it's not nearly as taxing to them as actually helping me solve the problem or pairing up with me.

So it's a good compromise somewhere between pair programming and talking to an actual rubber duck. Again, YMMV. Maybe I'll call it "Pair Ducking."


I call it the House method :)

You bring a detailed problem and break it down, and talk about it to someone else (who often isn't qualified to answer your questions due to knowledge/time constraints) - and in doing so - resolve the problem by challenging one's own assumptions.

This was effectively how every House episode was resolved.


When I'm stuck on a problem for way too long, I start typing it out in Stack Overflow. Usually by the time I'm done describing it, I've already solved it.


I've lost count the amount of times I've done that. Also, I'm probably the top of the pops in answer replies to my own questions.


I also feel guilty when I do this, but at least the answer helps others who might have the same question.


I think stackoverflow encourages[1] this, so no need to feel guilty.

1. http://blog.stackoverflow.com/2011/07/its-ok-to-ask-and-answ...


Haha, we called it the House method too, we ended up making a cardboard humanoid for when nobody was available.


This is so true, probably about 90% of the time my colleagues call me over about a problem they are facing, explain in detail what the problem is and then eureka! Most of the time it would actually take me much longer to figure out the exact issue since I don't know the ins/outs and subtleties of the code but it's exactly as you say.

Of course, it makes me look really good cos I just "helped" them solve their issue :)


Maybe because the solution is in asking the good questions.


Yup, for sure it is, sometimes just having someone else there and having to run through all the steps for them points out the obvious. There was another user further down that said he solved a lot of his own problem just typing them out with enough detail to be able to post on StackOverflow. Same principle.


On occasion, I'll write a question on stackoverflow and re-read it a few times before hitting submit just in case I get that eureka moment. I think I've written way more non-submitted questions than submitted questions.


You should consider submitting the question and answering it yourself, might help someone else.


I do this a lot too. In fact more often than not I end up not posting the question because either I solve the problem or I think of a possible solution I should go try first. I think we should start calling it Digital Rubber Ducking.


I am looking for an excuse to make digitalrubberduck(y/ie).com

If I was a bit more clever I feel like there is a use there.


Just make a programming themed chatterbot,drop some ads (If you are so inclined), and you are golden! Someone posted a plugin for IntelliJ that allows you to do just that. I would use a web based one of it existed.


I have occasionally used eliza for this (the basic chatbot in emacs and elsewhere). I'm sure with a bit of tuning you could make a debugging-centred variant of it.


What do the Unit Tests say?

Have you tried running a debugger and stepping through the code?

Hmm, go on.

Wait a sec, can you reexplain that last bit?


I have a whiteboard in a closet I use for this... seriously...I go in there and start jotting down notes and talking to myself.

It fascinates me how quickly I usually find the answer.


Amusingly, IntelliJ has a plugin to do exactly this:

https://sites.google.com/site/codeconsultantplugin


That's fantastic. I use IntelliJ for Android stuff, installing now. They should add a 'Duck' mode that just brings up a big picture of the rubber duck.


Working from home I tend to just write out my thoughts on a piece of paper. It works perfectly.


When I work from home (and I did it exclusively for 8 months of last year) I ended up talking to my wife (my 1 yo daughter wouldn't stand still for long enough).

My wife is a social worker by training, so it was pretty rare (though not unheard of) for her to be able to give me real input, but over the years I've trained her well enough to follow most of what I'm saying and nod at the right points :)


I send myself an email for the same purpose and set up an alias for /dev/null. It makes keeping notes a lot easier since I have the copy in sent mail and can reply to it as needed in the future. (this approach seemed less crazy before I typed it out...)


For me, this should be called stackoverflow debugging. I genuinely solved a lot of my problems by trying to write a _good_ question on SO about my problem. The problem seems really difficult when I try to ask it in one sentence, just out of my head. However once I try to describe the background, what I'm trying to achieve, what I'm using, when does the problem happen, simplified down to sub-cases, usually by the time I'd be 80% ready with writing the question, I realize the answer.


That happens to me a lot. Most of the time I just formulate the question I have in my head into something coherent and by that point I either have the solution, know what to search for or, in case it's not a question but a comment, I realize it's not worth saying.


I'm a serial SO self-answerer. I write really in depth, complicated questions for complicated problems, with code, data and testable cases - and by the time I've finished the question and posted it - I've figured out the solution - or I'll have it a few hours later.

I usually just leave the question/answer online so that others can benefit for it.


I've made several posts to SO and then realize the answer moments later. I usually just self-answer.


Same here: I've been working on a couple of projects by myself for the most part of last year, and when even the duck failed, I could usually figure out an answer just by trying to find the words to post my problem in SO in a way somebody would take the time to read it and be able to answer it. I don't recommend it as a first approach, though, since it's quite time consuming (Or maybe I should blame it on not being a native speaker...)


Likewise for me, but with IRC. Though I suppose I should try asking on SO first to save myself the semi-public embarrassment ;)


Yup, the incentive is there to state your problem as clearly as possible to get back a good response. By doing this I answer my own question half of the time.


There is a line from Futurama that perfectly applies ton a lot of debugging.

Farnsworth: My God, is it really possible?

Fry: It must be possible, it's happening.

Fry: By the way, what's happening?


Extremely appropriate as Fry is his own grandfather and the site software can't handle that relationship.


That's one of my favorite lines from Futurama, 'Ohh, a lesson in not changing history from Mr. "I'm My Own Grandfather"!'


Curiously, that episode was on TV where I live just an hour ago.



Oh, I know. Just wondered if the person who posted the comment above mine had just seen that particular episode, too.


Is this the forward time machine episode??

I love futurama more than any man could love any tv show.


I feel like your name reflects that fact. It seems to be a reference to the Banach-Tarski duplashrinker


More likely it is a reference to its eponym, the Banach-Tarski paradox.


It's both :) I studied math too.


It is -- it's when they are observing the second big bang.


Amazing? for anyone who has read Polya's "how to solve it" (http://en.wikipedia.org/wiki/How_to_Solve_It), that is hardly surprising.

If you don't understand your problem, you can't make a plan. If you can't make a plan, you can't execute it.

Another interesting lesson from that book is that one should spend time on evaluation (how did this come about? Could We have fixed this sooner? How are we going to prevent it in the future?)


People at work are amazed when I successfully debug an issue over the phone. In reality, it amounts to 50% experience plus another 50% of Sherlock Holmes: "When you have eliminated the impossible, whatever remains, however improbable, must be the truth". Once you've identified what you're dealing with via a few strategic questions, it becomes simple quite rapidly.


One of my favourite debugging tips has always been "give everyone full access to the folder/service" and see if the problem is "fixed". If so revert and now apply the correct permissions. I've seen this come up so many times although my superiors always complained "it's not the right way to do it", whilst I agree "every full access" is bad, this was for debugging purposes only!


Debugging is often best accomplished as a binary tree search aimed by familiarity/experience. Once you can put bounds on the search, it becomes possible to get the answer in just a few questions.

Totally agree.


This is the best way to work through debugging/troubleshooting as far as I can tell, amazingly a skill many people lack, and others that just understand it intuitively without it ever thinking about it. That is one of the big divisions between hackers and everyone else in my mind.


i think perhaps some people just don't see the world in a hierarchical way, so in their frame of mind, the problem is intractable.


It's amazing how often I am able to fix problems by simply trying all the possible solutions--often while colleagues are saying things like "stop wasting time, it can't possibly be that." But of course often it is "that".


The sort of debugging seems to have been around since the very beginning: http://blog.jgc.org/2010/05/talking-to-porgy.html


I'm not sure if rubber duck debugging would have helped here. The problem was in the data, not the code. (I know, I know: in Lisp code is data.)


That's exactly the sort of thing a duck will tell you. "So what has changed? Let's see, new code, new server, and I fixed the commenter's comment. That was simple I just hard-hacked the comment id and . . . excuse me a second."


Good point. I was thinking in terms of going through the code line by line, which if anything would lead you away from the trail.


Yep. I thought this through as I was typing my comment.

(There must be some joke involving the use of a meta-duck, but I can't come up with it. :) (Same principle applies, of course, just LISP makes the determining of "what" a bit more tricky. (insert discussion here about the general differences between debugging imperative and functional code)))


Rubber duck debugging may have actually been the distraction that caused pg to make the mistake, too!


This. Even with the best test coverage in the world, you still bump into edge cases that you couldn't have predicted. As a former QA Engineer, I used to say there's still room for QA in a test driven environment. Now I say there's no replacement for a sharp mind with enough knowledge, curiosity, and good judgement.


This is also why pair programming is so great.


Not easy on a Sunday while at home.


There are a number of comments that add up to "what steps will you take to ensure this does not happen again" - akin to a incident review. As speculation that's fine, as advice, I don't think it should be listened to.

I am reminded of an long-in-the-tooth sysadmin of my acquaintance who logged in everywhere as root. His theory - "they are my boxes. I screw it up, I fix it." I eventually realised that typing sudo every time he touched a box was no defence against doing the wrong thing.

An awful lot of sites at 1.2m views would have outsourced the running and development of the whole thing - there are entreprenuers who say its not even worth our time to code up the MVP. I find this approach sensible from a business point of view, but still it does not sit right with me.

I am supposed to have a nice website with lots of good content to attract inbound marketing - so I tried getting someone on textbroker to write an article for me. It read like a High School essay - no life, no anime. And so I will probably write my own CMS and my own content.

And pg sits there and writes his site in his own language, with his own moderation tools. Apart from the hilarious idea he could find a ten person ruby shop to outsource to, its nice to see someone taking the time to play again. Its why I like to see jgc on here too.

I am not entirely sure those thoughts are joined up (I am procrasting like crazy) but if they come to mean anything its we are playing in pg's sandbox. If the sand leaks it's his sand, and the only company this is mission critical to is YC.


Typing sudo won't save you, but using a higher-level interface will. Everyone I've ever known to change something in the database by hand, everyone at all, even on a hobby project that they know like the back of their hand, has screwed it up sooner or later. At some point the pain tells you you should stop doing that, and you create an admin tool that lets you do what you need to repeatably and safely.


I've never screwed it up on a live database, but I do take about 5 mins, first reviewing the keys, the type, whether or not something can be null, checking to see if critical columns have

    select count(distinct column_name) having count(distinct column_name) > 1;
To make sure that there isn't an underlying uniqueness assumption.

Sure I could do it in 10 seconds and save myself 290 seconds (a 97% savings!) but then one day I'd have to scramble like crazy in the middle of the night trying to figure out what I screwed up for hours on end.

I'm not saying don't build an admin tool, obviously those are needed for things like banning users, but just get in there and carefully fix the data if something is wrong.


This. Back in the day when I was in more of an analyst role, I ended up /having/ to hack on the live DB frequently (reasons for this were myriad).

1. Always, always make a backup just before the hack.

2. Write a small set queries like 3pt14159's to check uniqueness and other pertinent properties.

3. Write a SELECT query to show the data you are going to change.

4. Borrow the WHERE clause from 3, and write your UPDATE statement.

5. Run 4, and then run 3 again to see that you successfully fixed it.

6. When it goes wrong, restore the backup from 1 :0


Don't you have a dev db somewhere that you can replicate the live db to? Time spent setting that up will be more than repaid by the time and stress saved when you have to do a quick fix - you can simply run your changes, check it all works on your replicated site, and then make the changes on your live db (preferably with some sort of migration tool which applies the same sql and backs up first). If you have a regular backup process you could tie into that to populate the dev database.

Even if you can't replicate the entire live db, if you can automate backup, deployment of changes and test first elsewhere it makes the entire process far less fraught.


I'm going to add step 5b - save what SQL you executed (against what server, and for what reason), ideally in source control, as an audit trail.

Otherwise, I end up having this conversation (which actually happened):

Him: <Big Client> is having troubles! Features X, Y, and Z aren't working! Me: Hmm, has anything changed? It was all OK on Friday. Him: No, nothing's changed. Me: Really? Him: Well I ran a bunch of scripts on Saturday while I was visiting them. Me: OK, so what exactly did you run? Him: Just a bunch of scripts.


As a tip - Also backup your staging database and have all backups using something like Rsnapshot or maybe even in a version control system, something which does point in time backups.

I learnt this after I inherited a project which had been written by some Romanians and it was pretty horrible. There was no MVC framework, it was a hacked together mess.

Somehow the live site started using the staging database instead of the production database, both were on the same server. Every time we (the devs) pushed to staging a script would grab the latest version of the live database and overwrite (drop tables) the staging database. The assumption being that the staging database is a bit like a demo server, changes made to it are temporary and just for testing, but that it should look as similar to the main website (but updated) as possible. The production database was backed up in about 5 different ways, but the staging database wasn't backed up at all.

After about a week of vanishing books, books which authors had uploaded to the self publishing portable with descriptions and other information, we realised what was wrong. Their files stayed but their accounts and book details were wiped.

In another epic fail on the same server I later moved the root folders by running the following as root (I'd probably have been stupid and run the same command if not as root but I'd have put sudo in front of it). > cd /home/<username>/public_html/public_html > mv /* ../

I was meant to mv ./* (files from the current directory into one below cause they'd been copied across into the wrong folder. Needless to say moving the root folders such as /etc and especially /lib and /bin is a BAD idea. Although is fixable, but that's another story.


6. Should be ROLLBACK, a life saver. Works in postgres.


Maybe I'm old school, but shouldn't this be done in an dev or acceptance environment?

I hack on the "live" DB every day, and by live I mean i sync this DB to another environment, try it out, run it on prod.


One of the things I prefer to do is to only write UPDATE statements that update a single row. For example instead of:

UPDATE line_items SET quantity = 1 WHERE quantity < 1;

I'd script the following updates:

UPDATE line_items SET quantity = 1 WHERE quantity < 1 AND id = 123;

For each of the individual rows that needed to be changed. Then I have a check that I'm really updating just the rows I expect, this is especially important to me where the UPDATE involves joins, as I find this is the trickiest to get right.


Is there any other way??? :)

I thought everyone did this - well... for small datasets, skip the back, use a transaction. Rollback if your step 5 failed and try again.


This is pretty much exactly how I do it.

I still sometimes get that sinking feeling in the stomach that I have screwed something up, usually just after I hit the 'execute' button. And I really don't want to have to take the site down to run the restoration.


This reminds me of a feature that I wish that database systems supported: Make it impossible to execute DELETE or UPDATE statements without a WHERE clause.


> I eventually realised that typing sudo every time he touched a box was no defence against doing the wrong thing.

IIRC, sudo logs all commands to syslog. Which might come in handy. Yes, root commands will be logged by bash to .bash_history, but there are limits of # of commands lots, what happens if you are logged in multiple times into same account etc.

Anyway, that's why I like sudo.


Beyond the logging, which I love, I use it for the differentiation of states. I'm just a little more attentive when I type "sudo" before something.

At work, a relatively young engineer accidentally typed a command meant for a test database server into a production window. There was a big rush to restore from backups, and there was a small amount of data loss.

One thing that came out of the retrospective, requested by the engineer in question, was the production hat. Before you opened up a connection to a production machine, you had to put on the large pirate hat. You could only put it back when you had closed the connections. I didn't really need it, but it was a great way for people to learn the necessary caution.

It also ended up being a nice exclusive lock on futzing with production, and seeing it in use led to some good discussions that otherwise might not have happened. But the main thing was developing a strong differentiation of states in everybody's heads.


Another cheap solution is colorizing the prompt red for dangerous consoles and green for dev machines. This makes it very easy to notice when you selected the wrong terminal.


I use a different background color for the terminal in question; in this case, I use a dark red to signify a production system, and a dark blue to signify a development system. I find it quite useful. You can do this through Xterm profiles (Edit -> Profiles, Terminal -> Change Profile) in Linux and OS X, and I'm sure there's a way to do it in Windows / PuTTY.


I have tried the red / green console but never a pirate hat.

My son now definitely thinks work is like his school :-)


Plus security. With root login disabled a remote attacker won't have a known username to attack.


I find most of the time I'm using sudo I don't want to type it before every command and so I use sudo -i which pretty much negates the benefit of logging anything other than to tell that I was sudo at some point.


> It read like a High School essay - no life, no anime. ... I am not entirely sure those thoughts are joined up (I am procrasting like crazy)

Your procrasting like crazy has much anime.


s/anime/animus/ ? or is this a new usage of "anime"?


s/anime/anima - as in soul, vitality

(Not so much Jung's inner woman)

I think the sentence does read better if it is complaining there are not enough cyberpunk Japanese comics on my site though :-)


Here I was thinking you were referring to the Japanese meme "No ___, No Life!"


I'm not sure whether it's terrifying or relieving to realize that if all I dream of comes to pass and I achieve something akin to the legendary status of pg in the hacker community that I will still be susceptible to the inevitable facepalm moments that come with direct database access.

In any case I am thankful for the detailed explanation.


Some of the most spectacular airplane crashes are by the most experienced pilots.

If you've ever tried something new as a hobby you tend to be very careful. Once you gain confidence you take more chances and don't do what even a beginner might do.


Too bad we rarely get a postmortem on batshit insane production hackery that actually goes off without a hitch.


I would like to turn this into a poster or a t-shirt.


There must be a rare personality type that never experiences this kind of overconfidence. Perhaps a less glamorous cousin to the Buddhist beginner's mind?


There are certain classes of autistic people who are very good at always following the rules and finding people who are not. And in certain places they are exactly what you want.

http://www.nytimes.com/2012/12/02/magazine/the-autism-advant...


I find that if I am doing something "dangerous" (example might be using power tools) I have to say to my self "be careful this is dangerous" to avoid being on autopilot and making casual errors. Maybe a better example is the way you train yourself after you've picked up a box the wrong way and pull something to try to remember each and every time to watch your specific movements.


But at some point, you will become complacent, and it will taken a mistake to remind yourself again.

We've all done it. I shut down an NT4 production server because I was connected via remote desktop and clicked shutdown rather than log off. This was back in the day when there was no pop-up asking for reason you want to shut down and confirmation.

Luckily it was just our internal intranet server!


You were running NT4 Terminal Server Edition?


Yes, i think so. It was so long ago and it was my first programming role!


AFAIK that edition had logoff instead of shutdown on the Start menu for that reason. You can still access shutdown by hitting Ctrl-Alt-Del on the console or clicking Windows NT Security.


It was so long ago I have no idea. I defo vividly remember that I shut it down via the start menu, just one of many moments that stick out :)


I have to remind myself this every time I start up Sequel Pro now since in the last release they switched the command keys for Run Selected... and Run All...


I agree. I've been told that with motorcycle riding, the first 10K miles are the most dangerous. This is when you've gotten out of the newbie stage, but don't yet understand your own limits nor the bike's limits.


I thought it was a pretty cool error. I mean, you've got to not screw up on a lot of boring things before you can screw up this interestingly. Most failures are much more boring.


The amusing part is that no matter how legendary you become, restarting the server is always a good idea to solve problems. Software is rarely designed to run forever. Last week i had a moment of madness because a line of code remained buggy even after i debugged it. Turned out it was php's opcode cache that just needed a reset to get its wits back.


he's not legendary for his IT skills.


It was his IT skills that got him the big sale to Yahoo that got him the bucks to start YC. Not sure where this comment came from.


You really think yahoo bought viaweb because of pg's legendary ability to reboot servers and type "./configure && make && sudo make install"?


Now he is ;)


Low blow.


Great postmortem and good lessons to learn here:

* Don't manually modify database without a well-tested procedure and another pair of eyes

* Don't leave persistent problems (e.g. memory problems) uninvestigated so that you miss new problems with similar symptoms

* Don't push new code to production while operational problem is ongoing (unless it addresses the operational problem)

I'm pretty sure I've repeated this exact same sequence before with similar results...


* Don't push new code to production while operational problem is ongoing (unless it addresses the operational problem) ^^ absolutely!


* while you are displaying a tree keep track of the items you already displayed so you can detect a cycle


I think the assumption there was that it was safe, since the code disallowed this from happening, naturally.


Assumption is the mother of all screw ups.

Even if you think that the code that creates and modifies your data will not put it in some undesired state, the code that uses this data should assume that the data may be in all undesired states you can dream up and should do its best not to do something seriously bad when that happens (like landing in endless loop/recursion or executing possibly user provided strings).


use CHECK constraints to prevent invalid data patterns when possible.


Sadly, even that isn't enough.

In our production database, I used CHECK constraints religiously. Worked great.

Then one day, I was no longer able to commit ANY transactions to a particular table, even completely innocuous ones.

The problem? The database itself had violated its own CHECK constraint on a previous commit, but was enforcing it on all subsequent commits, causing them to fail. Brilliant.

Moral: not even CHECK constraints will save you.

----

P.s. This was a proprietary database, and when I reported the problem to the vendor (eventually, I figured out how to reproduce it), the vendor actually refunded our (expensive) support contract rather than fix the bug -- they couldn't figure out how to fix it despite having a small bug report that reproduced the problem.

In the end, I actually had to remove the CHECK constraint altogether. :(


> The problem? The database itself had violated its own CHECK constraint on a previous commit, but was enforcing it on all subsequent commits, causing them to fail. Brilliant.

I've never heard of that, and unless you're using a really buggy, broken database, it should not be possible.

> P.s. This was a proprietary database, and when I reported the problem to the vendor

well there you go. I think in practice, a simple CHECK constraint like the one we'd do here (literally, comment_id > parent_comment_id) is pretty easy to put one's faith into.


That's hard to do when you are afraid of databases and just store everything in files.


This should serve as a example template for how to accurately and transparently explain to users what went wrong. No deflecting blame, no useless platitudes.

Credit to PG, RTM and the rest of the team for keeping the sites uptime as high at it is.


"No deflecting blame"

Who were they going to blame?


He could have blamed the new server. Or whatever distracted him. Or the user, for being dumb. I've seen people do all of those.

Or he could have just dodged the blame entirely.


Some people are just freaking difficult to work with. I've worked with people that wouldn't accept responsibility even after every other possible cause was ruled out. I've even gotten this reply: "well, you must have been unlucky to get the faulty e-mail, because it seems to work most of the time". Yeah, because that's how programming works: cowboy coding and hoping for the best.

This guy actually called bugfixes "optimizations": "hey, X the feedback widget on the front page isn't working", "oh, yeah, I haven't worked on that because that's code that needs to be 'optimized', so it's low on my issues list". Ugh.

I've learned my lesson now. In fact, that's an incredible lesson for a startup founder: never, ever, hire someone who dodges a question on an interview. And the first time they avoid taking responsibility for something that was clearly their fault, fire them. The last thing you want is someone who'll blame everyone and anything else for their issues. It's a great way to kill morale and create rifts in a small team.


> And the first time they avoid taking responsibility for something that was clearly their fault, fire them.

I guess there's still difficult for a lot of people to acknowledge their own mistakes, maybe because they're afraid of getting fired for that (acknowledging the mistake), which in the case of startups/small companies happens very rarely.

From my own experience of working at startups for my entire professional career as a programmer (7 and a half years) I can tell you that the first step when noticing you f.cked something up is to take immediate responsibility and then asking yourself "how can I/we fix this?" (you might need the help of other people to fix your mistake). After you've fixed the issue the question should be "how can we make so that this doesn't happen again?". That being solved I'd say nobody cares anymore whose fault was it to begin with, there's always other more important stuff to do.

I agree that maybe at larger companies this kind of thing might happen exactly the opposite way, i.e. you can get fired for making a mistake and nobody really cares to fix other people's stuff, because their next paycheck/financial well-being does not depend on that (or so they think).


Exactly! When everyone is on the same boat the priority is fixing the stuff, then wondering whether attributing responsibility is important (most of the time it isn't. Who cares who fucked up the e-mail template, as long as it's fixed next time it runs.)

I think in this guy's case the causes for his reluctance (or rather, incapability) to accept his own mistakes had much deeper roots. He was literally the most self-centered person I've ever met, to the point that he wouldn't accept anyone's opinion on anything. He went out of his way to find doctors that'd go with his suggestions and run all kinds of tests on him to determine why he had a blood pressure problem, when he was clearly way overweight and had the unhealthiest diet I've ever seen. He'd dress up in shorts and t-shirts during the worst days of winter, and then take a niacin pill to force a capillary rush so his hands wouldn't feel cold (?)

Basically, he just thought the world had to bend to his will. Why use common sense, when you can just say "fuck it" and find a workaround that fits your mindset. Of course you can't expect someone like that to 'accept' his own shortcomings.

The scariest part of all this is he tried, for a while, to become a cop. Yep, imagine that: a 240lbs armed prick, completely unable to reason, forcing his way on everyone. I shudder at the thought.


> imagine that: a 240lbs armed prick, completely unable to reason, forcing his way on everyone. I shudder at the thought.

Where are you from where that's something you have to imagine, because I want to move there.


pg has an essay where he says the smartest people he knows are always willing to take blame or admit they don't know the answer to a question.


Sequoia Capital or Andreessen Horowitz


The user who replied incorrectly?


I don't know, it's a lot easier to be transparent when the stakes are so low. Most service providers have a real incentive to not put out quotes that can later be used against them, which tends to make explanations very technical or deflecting.


"But then I decided to just fix it for him by doing some surgery in the repl."

I've always found it's a good idea to not deviate. Whether it be running, parking or anything else once you deviate from some regular behavior you run into potential problems that you hadn't anticipated.

"For some reason I didn't check the comments after the surgery to see if they were in the right place. "

More or less my point. If this wasn't a deviation from normal behavior you would have "checked the comments after the surgery" because it would have either become habit or the shear number of times you tried a fix resulting in an error would have made that more likely to occur.


> I've always found it's a good idea to not deviate.

Aren't you assuming "surgery in repl" is a deviation? What if it's normal course of action for him?

> More or less my point. If this wasn't a deviation from normal behavior you would have "checked the comments after the surgery" because it would have either become habit or the shear number of times you tried a fix resulting in an error would have made that more likely to occur.

How about the opposite scenario? He has done it so many times with desired results, that he didn't bother checking?


> Aren't you assuming "surgery in repl" is a deviation? What if it's normal course of action for him?

This is a big difference between engineering and hacking. An engineer would never regularly do something so dangerous.

But I suspect pg isn't an engineer when he works on HN, I suspect he is a hacker, and just does whatever he wants to, whenever he wants. Which is his prerogative.


> I've always found it's a good idea to not deviate.

> you run into potential problems that you hadn't anticipated.

The second statement is no reason to live by the first. In fact, I think you'd be doing yourself a disservice by staying so comfortable. Being comfortable with the unanticipated, however, is a powerful quality to have.


That's fine, but try to become comfortable with the unanticipated on a test server, not the production server.


If you don't deviate from what you usually do you don't learn.

Obviously don't deviate from routine (or rather prescribed procedure) when you are running nuclear power plant or airplane maintenance. But when tinkering with the site that gives you no money and won't cost any lives you can loosen up a bit.


Why do "self posts" like this show up in the same light gray as posts with negative vote counts? My eyes aren't great and I find it hard to read


The rationale for this is that if you need to post a long text post, it should be in the form of a blog post instead. I agree with you that it's not really adapted for a meta post.


Maybe post color is based on some get_text_post_color method that applies to self posts and comments, where the color depends on comment karma. Given that self posts like the OP are votable as if they were a normal link post, their comment karma value is probably 0.


I don’t know pg’s reasons for making self-posts light gray, but you can fix problems like that with the bookmarklet Zap Colors: https://www.squarefree.com/bookmarklets/zap.html


I'm using "Hacker News Enhancement Suite" Chrome extension - it fixes multiple problems, including this one.


Disclaimer: Hindsight is 20/20, and stuff.

If reverting code didn't fix it, reverting server didn't fix it, incorrect data is the most likely culprit(I am not claiming this should have outright occurred to you; just thinking out loud). I take it you introduced non terminating recursion by making a thread its own parent, and you made the change on disk.

But this analysis is the last thing that comes to mind when you already have introduced 2 new variables the same day - new code, new server. And an old, recurring variable(GCing too much) is in play as well.


So what do you do to avoid this in the future? Do you stop doing surgery in the repl, or do you do the surgery with functions that check for cycles from now on?


This reminds me of the countless conversation I had with people after a crisis. What can we do to prevent this from happening again? What process can we put in place? What restriction needed to be tightened up?

And that's how processes are born.


> And that's how processes are born.

Not necessarily. Processes are implemented by people, so they can break at any time.

The correct solution is more code, or less bad code.


And how do you get less bad code? Magic dust or process? My money would be on the latter, as in http://www.fastcompany.com/28121/they-write-right-stuff.


A newsfeed with ranking is a bit far off a space shuttle.


> So what do you do to avoid this in the future?

It's HN... there's no SLA, there's no postmortems, there's no doing things better in the future. pg just runs this site out of the good of his heart, we should be lucky the volunteers run it for us at all.


"pg just runs this site out of the good of his heart"

Don't think it's a "good of his heart situation". HN provides a benefit to YC and YC companies and attracts people to YC. As another example Fred Wilson has a very popular blog AVC and has said many times that he considers it "his secret weapon" (or something like that) because the value it provides over his competition.


> there's no postmortems

1. Press [Home] key.

2. Read postmortem.

3. ???


> It's HN... there's no SLA, there's no postmortems

I didn't mean to imply that there were. I was just curious.

> there's no doing things better in the future. pg just runs this site out of the good of his heart, we should be lucky the volunteers run it for us at all.

Since there are multiple volunteers, I think that the site always feels important to at least one of them. I imagine that some of them have gone more than a week without giving a shit about HN, but not all of them at once. So I think there is doing things better in the future. In fact, HN keeps getting improvements behind the scenes, to keep it running, keep it interesting, and keep it from getting overrun with trolls.


> pg just runs this site out of the good of his heart

Hilarious. I would have believed you if you appended "and his wallet"


One core. One HD. Bandwidth is trivial with no images. How much do you think this site costs to run?


He means that PG runs the sight because it makes business sense, not out of charity.


You didn't understand what I said. I was implying that this site is a money-maker for pg, not that it costs him money.


His time. Maybe the core and HD are Very Nice ones, too, of course.


How much do you think the time of YC partners is worth?


I don't think parent did mean that he had a "right" to expect some level of quality or anything.

It's just that we, as programmers, tend to take measures so that silly bugs do not happen anymore or that, at least, we leave big clues as to what went wrong.

In a project I had a similar issue: I was wrapping lists inside immutable lists but, due to a silly bug, I kept wrapping immutable lists inside immutable lists at every save made. So saved files would grow bigger and bigger.

And I did fix the bug and also added a big fat warning logs in case too many nested lists were detected.

pg might just as well have now added something preventing infinite recursion inside the comment tree or some WARN logging telling when a generate page is getting too big, etc.

I'd still find it very interesting to know what pg did, if any, to dodge / minimize / make it easier to determine if such an issue happens in the future.


"We'll do it live!"


"Hey, Hold my beer and watch this!"


epic


This is a particularly endearing piece of "hacker news". It's so easy to relate to.


Are you saying you manually modify the database? Like, shifting around things by id instead of just making admin buttons next to posts?


I think I get what you're hinting at.

Ok, so this is Hacker News, it's in the name, and most of us are aware that HN is also a research/hobby project. It's not made to be an rock-stable enterprise system doing bank transactions or what not, so I think what pg did was prefectly excusable. People make mistakes. Nobody will die without HN for a day or two, and it won't affect the site's popularity one bit.


No that's right, but I worry about apache2 being down for potentially one or two users or bots that visit/crawl my website during a one-minute reboot. Meanwhile the big boys are down for 16 hours because they do things that any other person would have gotten a decent scolding for. Just look at the points per hour this thread is getting, if I had posted this about my website on my website people would have said I was stupid.

You are right though, making mistakes is human as they say, and nobody dies because of this. In fact, less popularity might be good for the site's content quality. I'm just surprised by how much they care about thousands of hourly users, that what I would dream of having.


I'd say this is a pretty important lesson: you don't need flawless technology and zero downtime to be popular and/or profitable. You need content worth viewing. People are more than willing to put up with technical errors if it's something they want/need.

Focus on providing people what they want/need, and don't worry so much about having flawless technology until you can employ a horde of PHDs.


I would say that's a observation you just made there.


Patrick McKenzie had a great horror story on his blog a couple of years back. He runs a service that provides appointment reminders to businesses' clients (e.g. "Don't forget, you have an appointment to get your hair colored at Best Little Hair House tomorrow at 3"). Long story short, an attempt to manually correct a hangup in the live system resulted in his product spamming his customers' clients (that's right — not just his customers, but their customers) with up to 40 phone calls back-to-back.

So, how many customers do you think he lost because of this? The answer is two, and one of them signed back up because they were impressed by the great job he did in handling the fiasco.

Moral of the story: As long as you really are making your best effort, you might be surprised how willing people are to deal with human error. Yes, they might be be mad, but a mistake is (usually) not the end of the world.


>Just look at the points per hour this thread is getting, if I had posted this about my website on my website people would have said I was stupid.

For what it's worth, I upvoted this thread specifically because we've all done something this stupid (or worse) :)


HN runs on plain files. He wasn't modifying database, but calling functions(I believe) in the repl to change the parent id of the thread.

But that apart, even if there were an admin button to change the parent id of a thread, he would still have made the same mistake.

Unless the code in question was checking for loops. In that case, repl would have worked the same.


Glad that you bought this up. If you have more knowledge regarding this, can you please explain how exactly the posts & nested comments are stored directly using flat files. How are concurrency issues handled?


I sort of meant that you shouldn't modify things like that directly. Be it a filesystem, database, or any other place that makes it possible to mess things up to bring a rather strong server down.


I see where you are coming from. But I am saying this didn't happen because he did things live. This happened because he entered incorrect id making a thread its own parent(or grandparent; doesn't matter).

This is the kind of mistake one would make even if you were writing proper migrations. He was doing things live isn't an issue; neither is an incorrect id. The issue is the code doesn't check for loops.


I am but an egg, I have two questions.

One, if the data were held in a database, should a change like this be captured in the database logs? I am seeing more and more situations where I want these, I notice that they are by default turned off for mysql and wonder if this reflects a de facto judgment that logging slows performance more than is usually worthwhile.

Two, if the data were kept in a database, wouldn't something like this be prevented by a constraint preventing a comment from making itself an ancestor? But I suppose there is a slight performance hit in checking such constraints, and the case arises so rarely that this hit isn't generally worthwhile.


> I notice that they are by default turned off for mysql and wonder if this reflects a de facto judgment that logging slows performance more than is usually worthwhile.

I think it's more like your application is doing the logging already(probably; most of the frameworks do). If you really need it, turn it on yourself.

> Two, if the data were kept in a database, wouldn't something like this be prevented by a constraint preventing a comment from making itself an ancestor?

Copy pasting the table from another comment.

    create table post (id int primary_key, parent_id int references post(id), child_id int references post(id), created_at timestamp)
There isn't a simple check constraint you can place to ensure a parent's, or a grand-parent's, or a grand-grand-parent's parent_id isn't child.id You will have to write a trigger.

This isn't really a big problem to solve. pg simply overlooked this problem. Had he not, he would have checked child.created_at > parent.created_at in his mutator method. So, when you do a post.parent = some_post(assuming mutator is parent=; replace it with post.setParent or (send post set-parent some-post) or whatever), it checks if post.created_at > some_post.created_at, and then assigns post.parent_id = some_post.id


Databases, at least the SQL kind, really aren't good at dealing with hierarchical data, and I don't know how you'd even begin to express that kind of constraint. I don't think a traditional database is the answer here. (If it were me, once I'd done it more than twice I'd write a "move thread" admin tool in the UI, and after I screwed it up like this I'd have a place to add such a check to).


If you were using some kind of representation for Nested Sets -- left-to-right depth-first numbering, or a human-readable id.id.id chain -- then it's really easy to write a constraint for that: parent left < myleft, right > myright, or dotted_id.split('.').filter{|first, rest| return false if rest.contains first} (yeah, yeah, that second pseudocode would be unrealistically PITA for some DBs).

More generally:

I'm not a big SQL wonk anymore, but I find a lot of people have the intuition that relational databases are ill-suited for trees.

An intuition that is much closer to the truth is that almost all databases can handle trees pretty well, because there's still an unambiguous concept of ordering and containment, and you can usually arrange things so as to do range/ancestor/inclusion queries efficiently.

It's graphs with loops/without unambiguous concept of ordering/containment that are really hard.


Found this: The excellent Postgres documentation includes an SQL graph search with two different ways of graph cycle checking, here: http://www.postgresql.org/docs/9.0/static/queries-with.html

One way involves accumulating an array of nodes already visited as the tree gets walked, checking each node as-visited for membership in the array-to-date.

The other method, a bit more of a hack, is just adding a LIMIT clause.

I think the 'WITH' clause is a great addition to the SQL standard, very much worth the learning the weirdness of its syntax and its optional 'RECURSIVE' term (which, as the Postgres documentation points out, isn't really recursion, it's iteration).


I think if you want the family tree, you can write a self referential(assuming post table is self referential as it should be) recursive query.

But in this case, writing a before insert/update trigger which ensure some_post.created_at < parent.created_at before setting parent.parent_id = some_post will do the trick.


Yes, I was trying to make posts editable on the HN instance I run, so I got clever and started messing with the files in emacs. Then I learned that the HN code does not like files in the story directory with ~ on the end of their name (emacs backup files), oops ;).


Sometimes that's easier (albeit more dangerous, as we just saw).


"Are you saying you manually modify the database?"

Oh manually modifying production database on the fly ain't unheard of.

However it's still not "very Chuck Norris" on a scale of Chuck Norrisness compared to the modification of a running app directly in the REPL. I mean: it doesn't matter if you manually modify the DB itself or not when you directly modify the app from the REPL itself (the app being anyway "in charge" of the DB).

Sure, modifying manually the production DB might be an issue to some. But I can guarantee you that it's the last of your worries when you're actually modifying production code directly from the REPL ; )


Do you have munin monitoring on the production HN server?

That would really make situations like this easier to debug. First, it can pinpoint exactly when something started happening, which in this situation might have helped you realize the problem was caused by your change. Secondly, in this specific situation it probably would have been easier to differentiate a situation where you are running low on memory vs this completely different situation.

As somebody who spent a lot of time professionally debugging large software systems when they were misbehaving (as a Google SRE), I can tell you that looking at graphs of many key metrics (disk IO, CPU, memory, then application specific things) was always the place to start when debugging a situation, because you can learn so many things right away. When did it start? Was it a slow buildup or an immediate thing? What is the general problem (Memory?, Disk IO?, CPU?, none of the above?)? Has a similar pattern happened in the past?

Then you can start to get fancy and plot things like "messages/minute" or something and then it becomes easy to see when issues are affecting the site performance and when they aren't.


That and something like the Smalltalk Change Log would have made this a no-brainer debug. (Yes, every REPL action in Smalltalk got logged by the same mechanism that logged every code change.) Such mechanisms aren't trivial, but they're not rocket science either, and they have tremendous ROI.


I wonder what exactly did distract you :) When I do surgery on a production server, I triple-check making sure everything works properly.

I have two assumptions: 1. HN has a low priority in the overall scheme of things, 2. Self-confidence overflow :)


Happens to a lot of us. Great reason to always write tested cleanup scripts for this stuff instead of editing directly on the server. The only time I brought down my product last year was from a similar screwup, I was removing users by hand and somehow managed to end up with a 0 in my list of user ids, thus deleting the anonymous user, and causing havoc to my server, which took a long time to track down.


Thanks for the detailed explanation.

It sounds like everything was done to fix the problem except try to figure out what the problem actually was. Why not use tools to see what the program is doing, form a hypothesis, gather data to confirm or reject the hypothesis, repeat until cause found, and then take corrective action that by this point you have high confidence will work?

I realize HN is more of a side project than a production service, but the goal is the same in both cases: to restore service quickly so you can move on to other things. It feels like a more rigorous approach would allow restoring service much faster than randomly guessing about what could be wrong and applying (costly) corrective action to see if it helps.

Besides that, in many cases (including this one), you cannot randomly guess the appropriate corrective action without finding the root cause.


I use assertions to protect against things like this.

I liberally sprinkle my code with assertions (CS theory calls them pre-conditions and post-conditions, iirc) to crash early if the system is an invalid state.

One my pet peeves is that few programmers seem to love assertions like I do. Would love to see to comments on this.


What assertion would you have used in this case? For every comment you'd have to iterate through all it's parents to check if there is a cycle, which seems pretty inefficient to do for something that should never happen (there are other ways that you could check for this problem as you go, but the only other ways that I can think of require holding extra state just in order to perform the assertion).

I'm for assertions when they are simple and don't cost much (especially during development), but it's not feasible to check every condition that should not happen.


You could assert a limit on depth, perhaps. Then the cycle would still exist but after X number of comments, the rendering ends.


This is a reasonable solution. While it will (almost) never provide the correct result (it might print out a cycle of comments until X is reached, or it might cut off a very long but legitimate comment thread), it would provide a reasonable guarantee on this sort of problem not generating infinite pages.


At the risk of being accused of flame-baiting, I'd say it's the engineering solution rather than the mathematical one.. ;-)

For some reason I tend to be a fan of the "stick it in a secure box" rather than "get it right in the first place" approach..


typically, if you're operating upon a particular comment, you've gotten there by traversing to it from the parent. Ensuring that traversals don't encounter cycles is easy, keep hold of a hashtable (or a set) of comment ids as you traverse. As the traversal encounters a comment, its id is added to the hash, and as you complete traversal of each comment, the id is removed. If you encounter an id that's already in the set, assertion failed - or better yet, log the condition and then cease the traversal. That way everything keeps running and the error is visible in the logs.

If the code is organized (as it should be) such that all functions which require traversal of hierarchical comments pull this from a single function, then the hash check only need be applied in that one place in the code, where it need not be visible anywhere else.


>iterate through all it's parents to check if there is a cycle,which seems pretty inefficient to do for something that should never happen

The number of parents is almost always under 3 or 4 and never over 100. Writes occur a few times a second at peak. You are prematurely optimizing.


The kind of assertion he needed though, could only be ensured by the database, not application code (my impression).


Agreed, infinite loops are a little hard to protect using asserts.

When I hit the first infinite loop bug on a code path, I frequently add code to assert that the number of calls is less than $A_LARGE_NUMBER to catch future occurrences of the same root cause.


I dimly remember a language that just hard-limited loops. I thought it was John Pane's HANDS system, but I can't seem to find a reference in the thesis...can anybody refresh my memory?

http://www.cs.cmu.edu/~pane/research.html

http://www.cs.cmu.edu/~pane/thesis/

Pretty cool work regardless, I really like the way it deals with aggregates, for example.


This is similar to the "while with timeout" that is common in embedded code (of course, watchdogs are better...)


> Agreed, infinite loops are a little hard to protect using asserts.

    assert(is_tree(comment_graph))
Typically, a composite entity (like an "item" on HN which has many "comments") will define invariants to ensure data integrity. In this case, the invariant is that an "item"'s comments form a tree.

The database layer often contains this logic, but it depends on how you're building your application; NoSQL backends for example typically must put validation in the application layer. Since HN just uses files, a well-developed application layer should be riddled with invariants like this.


The kind of assertion he needed could not be ensured by the database. The kind of assertion he needed was there are no cycles in the graph. How would you ensure that in a database?

Also, HN uses flat files, not database.


A constraint on the parent-child link table "Child creation time stamp > Parent creation timestamp" would do it.

Might not be a bad idea, if the site were to have the two requirements "maintenance must be done on the live site from a repl" and "5 nines availability".


How are you modelling your data? I think this should be a self reference.

    create table post (id int primary_key, parent_id int references post(id), child_id int references post(id), created_at timestamp)
How will you place the check constraint? You only have parent_id and child_id, not parent and child entities. You will have to write a trigger.

I am not saying this can't or shouldn't be done. I am saying a db won't directly solve it.

However, you example will work perfectly for enforcing constraints in the code via the mutator which can compare child and parent timestamps, provided pg was doing it via a mutator, and not directly changing the ids.


Following http://stackoverflow.com/questions/3438066/check-constraint-..., and assuming that ID's get doled out in increasing order:

    create table post (
      id int primary_key,
      parent_id int references post(id),
      CONSTRAINT foo CHECK (id > parent_id)
    );


I was thinking something like (supposing the comments were stored as "closure tables" like Karwin suggests):

  CREATE TABLE comment_tree (
   ascestor_id REFERENCES comments(id) NOT NULL,
   descendant_id REFERENCES comments(id) NOT NULL,
   CHECK ( ascestor_id <> descendant_id )
  )
but I'm probably overlooking something. (I'm aware that HN uses flat files, I was just making a counter-point to the "simple assert" solution...)


That will prevent a child being its own parent. It won't work for more than one level i.e a post being its own grandchild. Assume (post_id, parent_id) sequence: (1, 3) -> (2, 1) -> (3, 2).


you can assert that "post_id > parent_id", assuming comments are always created subsequent to the creation of their parents (as is the case here) and that integer identifiers are always increasing (otherwise use timestamps). (1, 3) above would indicate an invalid case (not necessarily a cycle, but a precondition for one).


Please note that the "Closure Table solution involves storing all paths through the tree, not just those with a direct parent-child relationship."


My bad. I was speed reading, and didn't read the "Closure Table" part.


Hacking code in the repl without testing the new behavior. We all did that. Don't lie. Once I wanted to quick fix a "gmail.ca" to "gmail.com", which I did.. but to all the users instead of just the one mistaken. Fortunately I realized by mistake really fast ;-)


The pink sombrero could have saved HN: http://www.bnj.com/cowboy-coding-pink-sombrero/


Appreciating the details.

"Hacker News was down all last night."

With the internet there is no "last night" ;-) Europe - and more so Asia I assume - had to live for many working hours without HN.


Yeah, I was more productive yesterday. But I did read google cached versions of HN.


PG, quick question: Did this impact the server hosting the YC Summer 2013 applications?

When I tried to edit mine, it simply said "Thanks, scotthtaylor"


Working again now.


>>I caused an item to be its own grandchild.

Please forgive me. I know you folks tend to hate jokes on here. Don't waste your time if you're immune to corny humor. "I'm My Own Grandpa- Ray Stevens" ( with family tree diagram) http://www.youtube.com/watch?v=eYlJH81dSiw


Great story! Yup, this kind of thing happens. For some reason it reminded me of something that happened to me as a newbie engineer. It was really funny a week later.

I was troubleshooting an intermittent problem in a piece of equipment. It had several boards full of mostly LS TTL logic chips (yes, them chips). It was the kind of problem that only happened once every other day or two. Nobody knew. So, I had all kinds of instruments attached to this thing and was watching it like a hawk waiting for a failure. It had probes attached to every point in the circuit where I suspected I could see something and learn about the source of the problem. I also tested for thermal issues with heat guns and freeze sprays, familiar troubleshooting techniques to anyone who's done this kind of thing.

Anyhow, every so often the thing would go nuts. The three scopes I had connected to it showed things I simply didn't understand. I'd analyze but couldn't make any sense out of it. Still, again, every so many days it would happen again. Changed power supplies and the usual suspects. No difference.

Well, finally, two weeks later, the other engineers in the office took pity on me and told me what was going on: They had connected a VARIAC to the power strip I was using to power the UUT (unit under test). The scopes and other test instruments remained on clean power. Every so often they'd reach into this drawer where the VARIAC was hidden and lower my power strip's voltage just enough for the power supplies to fall out of regulation and everything start beeping and sputtering. Those friggin SOB's. They had me going for days! I was pissed beyond recognition. Of course, after a while I was laughing my ass off alongside them. Good joke. Cruel, but good.

My revenge: A CO2 fire extinguisher rigged to go off into his crotch when my buddy sat down to work.

Fun place to work. We did this kind of stuff all the time. Today I'd be afraid of getting sued. People have really thin skins these days.


n.b. that this is why time travel is a terrible idea.


The grandfather paradox is the best solution to the halting problem.


Because the computers that run the Matrix will get overloaded?


Funny to see that this happens to everyone. A week ago, while testing some stuff to locate a low-importance bug, I erased the whole user database. Fortunately we have a good restore so the problem was solved in a few minutes, but still, cold sweat here ...


Isn't it curious that the comet incident over Russia happened so close to the pass of DA 14. In the intro to the book:

http://ruby.bastardsbook.com/about/#why

is the note about surgical instruments left inside. It seems just like a coincidence that this happened so close to the switch to the new server, but I wonder if it's something deeper in the subconscious mind; the change to the new server is quite a big change (I know I feel that way when I have purchased a new computer (it feels different - even if it's running the same linux as before)) and could have upset the normal checks one has in place when tweaking things.


Great debugging story!

I guess the lesson is to have code that alerts you about comment loops without going into an infinite loop.

Also another lesson would be to figure out a way to have better clarity into which requests are causing a timeout on the server.


I'm curious how was RTM able to notice that the problem seemed related to a specific item id? It would be great if he might write a short blurb similar to yours. Which also makes me wonder, why does RTM not write much?


[deleted]


Forgot your medicine, today?


I absolutely love post-mortems like this. It clearly identified that there was a problem, what the author tried to do to fix it, and if it was successful. Even if it ends with the author not knowing too much about the solution that was used, it's still so interesting to see the workflows and be able to derive something from it.

It's also why I like to read pg's articles so much. They're so in-depth and detailed and it doesn't feel you left thinking something was left out for the sake of being hidden.


I fall into a similar misdirected-focus trap, but mine is simpler: I waste an embarrassing amount of time in editing the wrong damned file. After a sequence of small tweaks that yield no change in the results, I make a huge change and see nothing, and then realize that I've done it yet again. I need to write a vim macro that blanks the screen every few minutes, displaying the message "are you SURE this is the right file?"


> created a loop in the comment tree; I caused an item to be its own grandchild.

Ah, the online forum equivalent of going back in time to kill your grandfather.


It's nice to know that even the mightiest of us can still make mistakes. Thanks for being willing to admit mistakes so the rest of us can learn.


Thanks for fixing it.

Have you considered avoiding dipping into the repl to do these kind of fixes? You don't owe any of us any sort of uptime guarantee, and you're a much better programmer than I, but it strikes me as odd that you would hack against the live server instead of create some tool that would make it so you couldn't take down the whole site when making this kind of fix...


Thank goodness it's back. I lost the my meaning of existence for the entire day. I don't know where my yesterday went. I'm ok now. :)


It's never what you think it is. One time, I had a memory leak in a Rails app that took me TWO WEEKS to find. In the end, it came down to me putting a line of config code in the wrong section of the config file, which for some reason created a recursive loop and caused my servers to crash about once every 30 minutes. #weak


Whoa, what an unfortunate coincidence. This whole bug would be so much easier to find, if it weren't for the new server.


The bugs that actually hit production are always like this - a confluence of three or so factors - because if it were simpler you'd have caught it earlier.

(Though I have to say, upgrading the code at the same time as you're restarting to fix a problem is really a rookie mistake. It's incredibly tempting because it saves so much time, but if you do it you will get it wrong sooner or later. One of the hardest skills in programming is acquiring that zen that you need to wait in a state of readiness for the effects of your first change to make themselves apparent, rather than changing something else)


That's funny -- I work in genealogy software, and loops ("being your own grandpa") happen all the time, due to data entry errors. To avoid infinite recursion, we always keep track of what records we've processed already, check whether "I've been there before", and bail out if the answer is affirmative.


You are honest and I respect that. I'm sure many companies try to play off their downtime as something far more sophisticated when in fact, it was something too embarrassing to admit. I've certainly had my fair share of embarrassingly stupid mistakes that resulted in downtime.


Cheers for that pg - now I have to explain to my boss why I was actually productive yesterday.


My pet peeve: You made an arbitrary change while debugging a problem. NOW YOU HAVE N^2 PROBLEMS!


That is why some organizations don't allow adhoc data fixes to be run in production. Best practice is to backup the database, run the fix against the backup, test the fix against the backup, and all being well run the fix against production.


Thank you for the honest explanation. This is not so easy especially for a famous person.


You should probably make your code robust to this sort of data corruption in the future.


"I don't know if Nick succeeded, because in the middle of this I gave up and went to bed." - Not a good example to your holding companies :) What would happend if they all went to bed when something goes wrong :) Just kidding.


Just about anyone that has programmed for any length of time has done something like this. It is one of those "fixes" that after it's actually fixed you try to never think of it again. Good to know PG is mortal. :-)


Related question: what is the timing for the 'Reply' link to show up? I might be fantasizing but sometimes it takes 5, sometimes 10 minutes to appear, leading people to reply as a sibling instead.


Deeply nested comments tend to be hot, and so the reply link takes a while to show up to try to give people some time to think about what they're going to say.


Interesting. So it slows down discussion as it progresses, until it either stops, is forgotten or becomes a series of long essays.


The deeper it is, the longer it takes to appear (I think at depth 7 or 8 it starts counting in hours, and at some point it just won't appear at all).

Just like too many nested if(x) { if (y) { if (z) ... }} constructs, too deep a discussion nesting is also unreadable.



When hacker news is down, I finally lifted my head and realized that there is life beyond the screen on my phone.

Now that it's back. I realized that it's finally time to create an account :)


It seems that in your new server and also latest pushed code, I can't do `like` anything. Honestly it's not a new bug and I got used to that, don't think about that.


That explains something weird I saw.

If I went to Google's cached copy, I could see threads, and then click on them. But the front page was down. But I could see individual threads.

Very confusing.


Did you add code that detects very deep nesting levels (e.g. depth more than 100) and throws meaningful exception to help developers to diagnose the problem?


it's a good job it's your site, this type of thing is often what gets someone fired in a company. Modifying (meddling!) the production system directly.


Dumb companies, maybe.

The goal was reasonable. The action was reasonable. What are you firing somebody for? Making mistakes? Good luck making that a hiring criterion. "Ok, tell us about a time you made a mistake and what you learned from it. What's that? You never have made one? Great, you're hired!"

The solution from a retrospective should never be, "Let's make people more scared to do the right thing." Or "Let's fire people with bad luck." Firing people of PG's caliber isn't a solution, it's just another problem.


I disagree. Very few companies would think negatively of an engineer if they made such a mistake on a non-essential, non-revenue-generating fun/research project.

How many dollars did YC lose because of the outage? None. (Maybe they saved a few on bandwidth!)

I also predict that exactly zero startups will say, "Man... I'm not going to take seed money from those guys! They had discussion forum downtime."


They could save even more if they shut it down! That's a ridiculous thing to say. We could all save money that way.

You've obviously never worked in a for profit corporation. in such there are policies and practices put in place to prevent just this kind of newbie mistake. You never modify the live database directly. Never ever. Whether it's a bottom line property or not.

I didn't say it would negatively impact YC's business. It might make them look incompetent, but these things happen, but people don't approach YC for their website savvy, they go there for the money and the connections. Most of the VC firms i've EIR'd at have much worse IT than hn. Their sites are barely usable. It seems to just go with the territory.

Let's not be so defensive, PG can do what he likes with his site, including take it down whenever he feels like saving bandwidth. But in the real world these kinds of things get real people on a fast track to their exit interview.


"You've obviously never worked in a for profit corporation"

Wow, really? I don't think that attitude is warranted at all.

At any company (for-profit or otherwise) there is a finite amount of time and money -- and surely we can agree that solid development/deployment practices carry an upfront time/money cost, can't we?

In an ideal world, all projects would have continuous build processes, automated tests, and management tools extensive enough to render live database surgery unnecessary.

Perhaps you've worked at companies so flush with cash that every single line of code, research project or otherwise, has gone through rigorous development/testing/deployment practices. If so, I'm jealous. I've always worked at companies that had to be choosey about how they spend their resources.


I've worked at a lot of for profit business (not banks though, i can understand that in those kind of business requirements are diferrent) and made live database updates in all of them. In 99% of the cases all goes well and in the remaining 1% you need to revert to your backup (always make backups before doing anything!) and have a couple of minutes of downtime.

It really depends on what kind of business you're in whether this is acceptable or not.


Upvoted. This is what I wanted to reply, but then thought better of it and moderated my response.


I think you people already know the answer. The amount of freedom and stake/reward system for pg is different from yours.

I can't speak for pg, but personally, I am not going to write a migration to re-parent a single thread if the site in question is my side project, doesn't bring revenue, has some intangible benefits, but not so much that warrant putting much labor into it.

Either it would be `thread.parent = new_paret_id`; or if it occurred to me that it might introduce a loop, changing `parent=` to take loops into account followed by `thread.parent = new_parent_id`. What were you expecting? A bug tracker discussion, code commit, review, change request and deployment?


The problem is thinking of it as the cost of implementing the feature vs. doing manual surgery on the production database, without realizing that if you choose the latter, you're also choosing the risk that you'll spend hours debugging the system when the surgery goes wrong. It's a tradeoff, to be sure, but it's not clear that the latter is cheaper on expectation.


no, you just leave it alone, it would have sorted itself out if nothing had been done at all.

failing that you put a cap in the code that generates the page. Simple stuff.


Don't worry I just figured out the totally bone-headed programming mistake I made at noon today. Time is a good mediator between skill and stress.


Much appreciated, pg. I knew that the "10 minutes of downtime" would not occur (fair enough, this was not related to the server upgrade).


Ok, I'll just say it: that's just plain dumb. It's a rare case, but a simple condition would have checked and prevented this. :)


Did you hear about the tortious and the hare?


Everyone fat-fingers a database at some point... Then you build interfaces so that you can't make the same mistake.


Why does PG maintain the website himself? I would think he would have many better things to do with his time.


That's funny because one of the top posts in progit yesterday was about the Hare and Tortoise algorithm


Thanks for the explanation pg. As you said in the original thread, "you know how these things go..."


The nice thing about surgery with a computer program on a server is that death is not permanent.


It makes me feel good knowing better programmers than me go through the same issues I face. :)


I think this demonstrates how many people browse the /threads?id=pg page (myself included)


TIL there are still large web sites out there that do not have staging environment.


I appreciate (if not love) the fact that you bugfix and server-change yourself.

True hacker spirit.


Sorry about that.

No worries.

So, are we back on the new server? Or was this too much for one transition :)


Normality has returned :-)


I was almost sure it was Anonymous! ... Are you in Anonymous, pg? :D


Same problem in Iran, I couldn't access to HN all last day.


I guess it's time for NewRelic to add an Arc agent.


That sort of thing is fine for a startup in it's first year or two of life, but HN has been around for a while now ... surely you must have some sort of process by now?


So, uh, still fixing stuff in production?


The cobbler's children have no shoes :-)


pg,

Just wondering as to why HN isn't hosted in the cloud? (e.g. on AWS, Rackspace etc.). How do you backup all the data?


Because that would cost more?

I don't really know what the benefit of cloud hosting would be in this case.


I appreciate pg's frankness here.


i hope that user wasn't me. i was editing a typo in comment right when it happened


I think we have a new king!

Awesome explanation.


"I am my own grandpa"


Even Homer nods!


>On a comment thread, a new user had posted some replies as siblings instead of children. I posted a comment explaining how HN worked. But then I decided to just fix it for him by doing some surgery in the repl.

No good deed goes unpunished!

People sometimes reply as sibling because they too impatient wait for the built-in delay on child comments.

Thanks for keeping the experiment going.


If the problem existed before the code update, why would you assume it was the code update that caused the problem?


After breaking many things myself due to similar, seemingly miniscule edits, I have implemented an ABC routine: Always Be Checking. Even if it was "just" something like moving a piece of code or something equally tiny, I always check after the fix.

So far, it has been working great.


Are you sure that comment's name wasn't Phillip J Fry?


Apologies for the trivial comment




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: