
Post mortem of a failed HackerNews launch - gingerjoos
http://www.gigpeppers.com/post-mortem-of-a-failed-hackernews-launch/
======
zorlem
A few things have caught my attention in your post.

Your biggest problem was that the configuration of your services was not
sized/tuned properly for the hardware resources you've got. As a result of
this your servers have become unresponsive and instead of fixing the problem,
you've had to wait 30+ minutes until the servers recovered.

In your case you should have limited Solr's JVM memory size to the amount of
RAM that your server can actually allocate to it (check your heap settings and
possibly the PermGen space allocation).

If all services are sized properly, under no circumstance should your server
become completely unresponsive, only the overloaded services would be
affected. This would allow you or your System Administrator to login and fix
the root-cause, instead of having to wait 30+ minutes for the server to
recover or be rebooted. In the end it will allow you to react and interact
with the systems.

The basic principle is that your production servers should never swap (that's
why setting vm.swappines=0 sysctl is very important). The moment your services
start swapping your performance will suffer so much that your server will not
be able to handle any of the requests and they will keep piling up until a
total meltdown.

In your case OOM killing the java process actually saved you by allowing you
to login to the server. I wouldn't consider setting the OOM reaction to
"panic" a good approach - if there is a similar problem and you reboot the
server, you will have no idea what caused the memory usage to grow in the
first place.

~~~
foxylad
Appengine.

You're a development shop, not scalable system builders. Deciding to build
your own systems has already potentially cost you the success of this product
- I doubt you'll get a second chance on HN now. If you were on appengine,
you'd be popping champagne corks instead of blood vessels, and capitalising on
the momentum instead of writing a sad post-mortem.

I'd recommend you put away all the Solr, Apache, Nginx an varnish manuals you
were planning to study for the next month, and check out appengine. Get
Google's finest to run your platform for you, and concentrate on what you do
best.

~~~
jedc
I wish I could vote this comment up 10 times over.

I know that I know little to nothing about sysadmin, so when I built a recent
app I used AppEngine for this very reason. And when it got onto the HN front
page it scaled ridiculously easy without any configuration changes. (No extra
dynos, no changes at all.)

And when I've occasionally screwed up and done stupid stuff, it still doesn't
go down. (To be honest, I first saw the problem when I noticed my weekly bill
was ~$5 instead of the baseline $2.10. It helped that being a paid app pushed
the limits up a lot higher.)

------
ck2
1\. Reduce keepalive, even with nginx 60 is too much (unless it's an
"expensive" ssl connection).

2\. set _vm.swappiness = 0_ to make sure crippling hard drive swap doesn't
start until it absolutely has to

3\. Use _IPTABLES xt_connlimit_ to make sure people aren't abusing
connections, even by accident - no client should have more than 20 connections
to port 80, maybe even as low as 5 if your server is under a "friendly" ddos.
If you are reverse proxying to apache, connlimit is a MUST.

~~~
jacquesm
Don't bother reducing keepalive, just disable it altogether. Unless you have a
very specific use case it is more trouble than it is worth.

~~~
zorlem
_> Don't bother reducing keepalive, just disable it altogether. Unless you
have a very specific use case it is more trouble than it is worth._

Bad idea. This way you're actively increasing the latency of your site. This
way, for each asset that has to be fetched you're forcing the client to open a
new connection, which can add more than 150 ms of delay per item (thanks to
the three way TCP handshake).

What I would suggest is setting the KeepAlive timeout to a value that could
handle each individual page-load. This way all the page elements will have a
chance to use the connections that has been already opened.

~~~
thaumaturgy
Actually, I've tested this. You get 10ms of extra delay per request, not 150.

There might be some magic value at which KeepAlive will be helpful during non-
peak periods without crippling the server during peak periods, but for a well-
engineered site, the extra 10ms delay per request shouldn't be a big enough
deal to warrant risking a full-on site outage later on.

Also, this has already been discussed _to death_ on HN:
[http://www.hnsearch.com/search#request/all&q=keepalive&#...</a>

~~~
prodigal_erik
Light travels less than two thousand miles in 10 ms, and TCP requires three
one-way trips before starting the first request on a new connection. Anybody
more than 620 miles away (about half a time zone) is guaranteed to have a
higher ping time than that.

~~~
thaumaturgy
You're absolutely right, I didn't think about that. However, I did perform the
test(s) from Sacramento, CA to a server in Newark, New Jersey -- a distance of
2,810 miles according to Google.

I was fairly careful with the test(s), and the 10ms difference seemed to be
consistent. So that's odd. I need to investigate that further.

------
buro9
If anyone owns a blog or site that they suspect _may_ appear on HackerNews
(especially if you're posting it), then please take the small amount of time
to put an instance of Varnish in front of the site.

Then, ensure that Varnish is actually caching every element of the page, and
that you are seeing the cache being hit consistently.

You should expect over 10,000 unique visitors within 24 hours, with most
coming in the 30 minutes to 2 hours after you've hit the front page on HN.

You need not do your whole site... but definitely ensure that the key landing
page can take the strain.

Unless you've put something like Varnish in front of your web servers, there's
a good chance your web server is going down, especially if your pages are
quite dynamic and require any processing.

~~~
ChrisNorstrom
Oh, Well then let me share this.

A few weeks ago I got on the front page and within a 24 hour period was hit
with 29,000 unique visitors with 38,000 page views. The page itself is image
heavy with 1.3 MB on first load. I'm running Wordpress with the Quick Cache
plugin by PriMoThemes. I'm hosted on a shared 1and1 server.

I've been hit before and went down, that's when I installed the Quick Cache
plugin. Also 1and1 moved me to another server at some point but I'm not sure
if it was before or after. Either the cache plugin is really good or I'm on a
rockin server all by my lonesome. Or both.

If you're self hosting a wordpress site grab the Hyper Cache plugin or the
very very simple Quick Cache plugin by PriMoThemes.

[http://www.tutorial9.net/tutorials/web-
tutorials/wordpress-c...](http://www.tutorial9.net/tutorials/web-
tutorials/wordpress-caching-whats-the-best-caching-plugin/)

~~~
px1999
Thank you. This is pretty much exactly what I read the post for and was a
little disappointed not to find. People talk about getting frontpaged on HN,
Slashdot or Reddit, and how you need to be sure you can handle the load, but
never give any useful figures on what that load is.

Knowing that I can serve 10 requests per second and likely withstand a
frontpage on HN is more useful to me than knowing that I need a way out if and
when my web-servers are being totally overwhelmed.

~~~
ISL
On an unexpected trip onto the front page, here's what I saw (hits at ~1 Hz on
a Friday evening in the top slot):

<http://measuredmass.wordpress.com/2012/10/12/hacker-news/>

Graphical post-mortem here:

[http://measuredmass.wordpress.com/2012/10/20/more-hn-
numbers...](http://measuredmass.wordpress.com/2012/10/20/more-hn-numbers/)

------
driverdan
I'd argue the opposite of your headline, that this was a very successful
launch. Since HN isn't your target audience having your site fail from the
traffic was far better than having it fail from a launch in your market. You
shook out some important bugs before you lost real users. Plus you got to do
this followup which will bring even more traffic.

------
bdcravens
First off, best of luck with your project. Secondly, kudos on writing the
post-mortem, as I know it takes some guts to own a "failure".

I think, however, the need to write something like this speaks to an
incorrection assumption: you need a "launch". Of course, TC and HN can give
you a nice bump in traffic and even signups. However, in the long run, this
really doesn't accomplish much for you. It gives you the kind of traffic that
will likely leave and move on to the next article, skewing your metrics.
There's certainly qualified prospects in there, but it's hard to decipher with
all the noise.

Again, the concept of a "launch" speaks to poor business models. It really
benefits businesses where the word "traction" is more important than
"revenue". Build a business that provides a service that others will pay for
and grow as fast as the business can bear, bringing in those visitors that are
truly valuable to you.

~~~
Cherian
When we did the beta, I posted the launch details on HN and you’ll be
surprised by the amount of constructive feedback and users that we got.
Cucumbertown now has users from devs to CEO’s who came in through HN and are
now engaged users.

Cucumbertown has some notions like 'forking recipes' – called “Write a
variation” which enables you to take a recipe and fork and make changes.
Additionally Cucumbertown has a short hand notation way to write recipes(think
stenography for recipes) – for advanced users. Things like these appeal to the
HN crowd a lot.

Also, don’t you think quite a few hackers like me are also cooks!

~~~
pinko
Wow, forking and shorthand are great features! I had no idea from your
homepage. Maybe I'm not your target audience but you should make that clearer
the moment someone lands on your homepage. ("Why we're different and maybe
better than AllRecipes or XYZ: ...")

~~~
Cherian
Thanks for the feedback.

Cucumbertown is very UX focused and from our research we came to a conclusion
that our “aha” moment is to get you to the fastest possible way to write a
recipe and give you that bout of joy. Now being hackers, we’d want to see
forking & stenography in front of us. But that’s been a struggle we’ve been
trying to showcase between simplicity to the “aha” moment and differentiation
with others. Our primary audience comes to Cucumbertown because they are
frustrated writing recipes in dropbox, google docs, wordpress & tumblr blogs
as “blobs of text”.

------
mekoka
Thanks for this post, there were some nice tips in there. Although, I do have
some nitpicking about your writing style. Maybe it's just me, but I found that
your use of "+ve" instead of just saying "positive" and of "&" instead of
"and" did not have the intended effect of speeding up reading, quite the
reverse actually.

~~~
politician
Seconded. Initially, my brain told me that +ve was the name of the site, so I
was confused when I looked for a product page link only saw "Cucumbertown".

Granted, that's mostly laziness -- apparently I've got a rule that matches
"strange words near the top of the post" to "probably the name of the
product".

~~~
chacham15
I dont know about anyone else, but when I saw "+ve", I just thought to myself,
"what is that?" for about a half a second before giving up and moving on.

~~~
rolleiflex
It doesn't even parse for me. If "+ve" means "Positive", what does "+" mean?
Positi?

Regardless of what you do, a little bit of respect for English is always a
good thing to have.

------
lmm
Running with swap enabled is a terrible idea. The authors mention how it was
only once solr crashed that they were able to actually log in and start fixing
problems; having swap means that rather than the OOM killer terminating
processes, instead your whole system just grinds to a halt.

(it's strange that they recommend enabling swap when they also recommend
enabling reboot-on-oom, which is pretty much the complete opposite philosophy)

~~~
richardwhiuk
In my experience, the linux kernel handles no swap at all very badly, so you
need a small amount.

Increasing the swap, which is the suggested solution, is however, a terrible
idea. As soon as you hit high memory usage, your IO load will go through the
roof, and everything will grind to a halt.

The solution here is separation of services - i.e. put Solr on a different
box, so that if it spirals it doesn't take out other services.

The OOM killer is your friend for recovering from horrible conditions, but as
soon as you hit it or swap, somethings gone wrong.

~~~
Cherian
You are right. The best solution is separation of services. But for a startup
than runs 7-2 services like this – it’s a close call. You’ll often have to run
2-3 services together, else $100 * 7 machines is too much burn

~~~
zorlem
It's not a problem running several services on the same box as soon as each of
them is sized appropriately. What I suggest at least roughly calculate how
much eg. RAM could each service use at peak time. This usage should be limited
so that the sum of memory used by all services at peak time is less than the
amount of RAM you've got on your server.

------
boundlessdreamz
1\. So nginx didn't cache because of cookie?

2\. Isn't swapping bad? I don't think I've ever had a situation in which swap
more than say 100MB was helpful. Once the machine starts swapping, a bigger
swap just prolongs the agony.

3\. If you couldn't ssh, why didn't you just reboot the machine?

Edit:

1\. What did you use for the graphs?

2\. What is the stack?

~~~
Cherian
We use DataDog - <http://www.datadoghq.com/> . It’s a statsd, graphite
manifestation but with much more capabilities.

Stack is Python, Django, PostgreSql, Redis, Memcache etc.

------
TeeWEE
Stress test, load test before launch!

It doesn take more then an hour, and you quickly know what your upper limits
are, and where the bottlenecks are.

I use gatling in favor of JMeter: <https://github.com/excilys/gatling>

------
nasalgoat
I find it very difficult to believe that this person worked on any sort of
performance team, given that what they discovered is pretty much "Handling
Load 101".

Running everything on one box? Using swap? No caching? It's like a laundry
list of junior admin mistakes.

~~~
druiid
Generally that's kind of how these smaller 'startups' work... and then I get a
call and charge my standard rate per hour ;)

------
jsaxton86
This post mortem has me thinking about the best way to handle the situation in
which you can't SSH into your server. The OP decided to trigger a kernel
panic/restart on OOM errors, but I have a couple of concerns about this
approach:

* If memory serves correctly, if your system runs out of memory, shouldn't the scheduler kill processes that are using too much memory? If this is the case, the system should recover from the OOM error and no restart should be needed.

* OOM errors aren't the only way to get a system into a state where you cannot SSH into a system. It would be great to have a more general solution.

* Even if you do restart, unless you had some kind of performance monitoring enabled, the system is no longer in the high-memory state so it will take a bit of digging to determine the root cause. If OOM errors are logged to syslog or something, I guess this isn't a big deal.

I suppose the best fail-safe solution is to ensure you always have one of the
following:

* physical access to the system

* a way to access the console indirectly (something like VSphere comes to mind)

* Services like linode allow you to restart your system remotely, which would have been useful in this scenario

~~~
adrianpike
* In linux-land, there's an OOM killer (<http://linux-mm.org/OOM_Killer>) that would have started taking processes out. You have to exhaust swap for it to really take effect, and once you hit swap, your entire machine suddenly becomes hugely IO bound - in shared or virtual hosting environments, this usually makes the machine totally unresponsive.

* I've never seen any sort of virtual hosting service without either a remote console or a remote reboot. Usually both.

------
debacle
I clicked on the link to Cucumbertown and was immediately greeted with a
picture of Italian seasoned chicken thighs.

I think I really like your website. I really like the simplicity of the
presentation to the user.

------
nemesisj
This is a great way to make lemonade out of the lemon of getting hosed by a
lot of traffic. Write an informative post-mortem and resubmit! I know I missed
the original submission and clicked through to the site, and there you have
it. I'd say being humble and trying again is never a bad idea.

------
antirez
Whatever was the _real_ cause for your issues, Linode's default small swap
space is a plague. A system starts to misbehave much gently if there is enough
swap.

~~~
zorlem
For a production server I think that the opposite is a better _general_ advice
- reduce the available swap, because if your server gets to a point to need
it, the performance will suffer so much that your server will become
completely unresponsive. Having less swap will allow the OOM to kill the run-
away process and allow you to login and fix the problem instead of rebooting
the server or waiting in vain for it to recover by itself.

 _edit: typo_

~~~
X-Istence
Drop your swap on a separate drive from your main drive that is serving your
data. Solves the problem nicely.

------
alexbrand09
I am currently building a site and this is definitely an experience that I can
learn from. I am wondering, why was the homepage not being cached?

~~~
zorlem
There was a cookie set for CSRF protection and the headers specify that
content should not be cached if there is a cookie (or more precisely - the
cached content includes the cookie as a cache key, so each request with a
different cookie gets a cache-miss).

~~~
Cherian
Slightly different.

"Note that in 0.8.44 behaviour was changed to something considered more
natural. As of 0.8.44 nginx no longer caches responses with Set-Cookie header
and doesn't strip this header with cache turned on (unless you instruct it to
do so with proxy_hide_header and proxy_ignore_headers). "

<http://forum.nginx.org/read.php?2,126312,126316#msg-126316>

~~~
zorlem
Thanks for the clarification.

------
saltcod
[ not that you asked for it here, but I've got some frontpage UI feedback: ]

I think you should put a description up front to describe what Cucumber town
is. I think that main image should be a slider with multiple feature images,
and I think the Latest Recipes should be the first section after this. Just my
2c!

Screen: <http://cl.ly/image/3R2Y131Z433L>

~~~
Cherian
This is wonderful feedback. We’ll definetly look into this.

------
runarb
I have been having some of the same issues on a site I run (
<http://www.opentestsearch.com/> ). Under heavy load solr will grind to a halt
if you don't have enough ram available.

Putting a dedicated Varnish server in front of the search servers helped a
lot. Using a cdn may also be a viable option, but haven't tried it myself.

~~~
druiid
Well, and this is why I recommend running solr on a standalone instance. It
(and java/jetty/tomcat/etc.) are very memory hungry in general, so it is worth
your while and money to spend a bit more and spin up a separate instance or
whatever type of services you are using to run solr. It'll also run faster.

One last thing you can do if none of that is possible is use a better VM like
Jrockit
([http://www.oracle.com/technetwork/middleware/jrockit/overvie...](http://www.oracle.com/technetwork/middleware/jrockit/overview/index.html)).
Jrockit with the right GC in my experience is much better about running in
lower memory type situations.

------
pothibo
That's why I like to use Heroku/EC-2 for launching new webservice. If shits
hit the fan, you can jack up the processing power/database/RAM/whatever to
scale to your demand. Once you have a good idea of the traffic it generates,
you can then move it to a cheaper service.

Obviously, it's easy to say that when you're on the bench. Congratulations on
the launch by the way.

~~~
Cherian
Cucumbertown co-founder here. Actually I dislike this idea though we should
have been better prepared.

At my previous firm we had this culture that whenever traffic peaks we spin up
new instances. And tools like RightScale & Chef make it ridiculously simple.
So our style was to do that than to optimize strains in code paths. Because
this is so so convenient.

And before you know it, this notion of __hardware is cheap __becomes a
culture. Soon enough if you grow you’ll be serving 100K users with 250
machines.

~~~
pothibo
I understand what you are saying.

I do agree that Chef [and RightScale? Never used it) makes it easy to spawn
new instances and through load balancing average your load.

I was talking in term of tradeoff in the first few weeks of a new service with
a MVP. Obviously, you re-assess your need before you get to 100k user, and
probably uses something else than EC-2.

In any case, both ideas are equally good. I don't claim better knowledge in
any way.

------
lnanek2
Wow, nice post with real data, graphs, and helpful tips. As the Germans would
say, I have nothing to complain about.

------
mp3tricord
Once memory goes to swap you already lost. Personally I rarely configure swap
on servers, save the DB. I would reconfigure your services to not grow past
physical free memory. After that you are going to have to scale servers
horizontally.

------
elchief
lamesauce.

1\. HN should let you pay them $10 and let them hammer your server(s) before
your story goes live. good for you. good for them.

2\. there's a deal at lowendbox right now for a 2GB VPS for $30 a YEAR. you
could have a healthy server farm for pretty cheap.

------
James_Henry2
This is really interesting, thanks for sharing. I love this kind of
transparency.

------
chacham15
Is there a way to run simulated traffic to determine how your server will
react based upon heavier load to try and determine how many people it can
serve?

~~~
pearkes
Blitz is also worth a mention.

<http://www.blitz.io/>

~~~
ohashi
Can also connect newrelic with it and get stack traces logging how much time
each component of your app is taking including DB queries.

------
fox91
Please, fix your CSS for mobile use. It's impossible to read because, if I
zoom, the sidebar gets bigger too

------
srameshc
well, you got it right this time :)

------
maxent
Your project, Cucumbertown, is a cooking/recipe site/platform/network. Hacker
News is not your audience/customer. Any "launch" on Hacker News is a fail,
regardless of downtime.

~~~
tzaman
Couldn't disagree more. HN community is the most (brutally) honest and it's
feedback can prove very valuable - regardless whether HNers are the target
market or not.

