Hacker News new | past | comments | ask | show | jobs | submit login
Post mortem of a failed HackerNews launch (gigpeppers.com)
150 points by gingerjoos on Nov 29, 2012 | hide | past | favorite | 101 comments

A few things have caught my attention in your post.

Your biggest problem was that the configuration of your services was not sized/tuned properly for the hardware resources you've got. As a result of this your servers have become unresponsive and instead of fixing the problem, you've had to wait 30+ minutes until the servers recovered.

In your case you should have limited Solr's JVM memory size to the amount of RAM that your server can actually allocate to it (check your heap settings and possibly the PermGen space allocation).

If all services are sized properly, under no circumstance should your server become completely unresponsive, only the overloaded services would be affected. This would allow you or your System Administrator to login and fix the root-cause, instead of having to wait 30+ minutes for the server to recover or be rebooted. In the end it will allow you to react and interact with the systems.

The basic principle is that your production servers should never swap (that's why setting vm.swappines=0 sysctl is very important). The moment your services start swapping your performance will suffer so much that your server will not be able to handle any of the requests and they will keep piling up until a total meltdown.

In your case OOM killing the java process actually saved you by allowing you to login to the server. I wouldn't consider setting the OOM reaction to "panic" a good approach - if there is a similar problem and you reboot the server, you will have no idea what caused the memory usage to grow in the first place.


You're a development shop, not scalable system builders. Deciding to build your own systems has already potentially cost you the success of this product - I doubt you'll get a second chance on HN now. If you were on appengine, you'd be popping champagne corks instead of blood vessels, and capitalising on the momentum instead of writing a sad post-mortem.

I'd recommend you put away all the Solr, Apache, Nginx an varnish manuals you were planning to study for the next month, and check out appengine. Get Google's finest to run your platform for you, and concentrate on what you do best.

I wish I could vote this comment up 10 times over.

I know that I know little to nothing about sysadmin, so when I built a recent app I used AppEngine for this very reason. And when it got onto the HN front page it scaled ridiculously easy without any configuration changes. (No extra dynos, no changes at all.)

And when I've occasionally screwed up and done stupid stuff, it still doesn't go down. (To be honest, I first saw the problem when I noticed my weekly bill was ~$5 instead of the baseline $2.10. It helped that being a paid app pushed the limits up a lot higher.)

Any PAAS, for example Appfog would do,

I'd say that the biggest problem is that they tried to launch their product on what appears[1] to be a 4G host, representing maybe $3-400 of hardware cost (maybe more if you buy premium, I doubt Linode does).

I mean, careful configuration and capacity planning is important. But what happened to straightforward conservative hardware purchasing where you get a much bigger system than you think you need? It's not like bigger hosts are that expensive: splurge for an EC2 2XL ($30/day I think) or three for the week you launch and have a simple plan in place for bringing up more to handle bursty load.

[1] The OOM killer picked a 2.7G Java process to kill. It usually picks the biggest thing available, so I'm guessing at 4G total.

1. Reduce keepalive, even with nginx 60 is too much (unless it's an "expensive" ssl connection).

2. set vm.swappiness = 0 to make sure crippling hard drive swap doesn't start until it absolutely has to

3. Use IPTABLES xt_connlimit to make sure people aren't abusing connections, even by accident - no client should have more than 20 connections to port 80, maybe even as low as 5 if your server is under a "friendly" ddos. If you are reverse proxying to apache, connlimit is a MUST.

> 3. Use IPTABLES xt_connlimit to make sure people aren't abusing connections, even by accident - no client should have more than 20 connections to port 80, maybe even as low as 5 if your server is under a "friendly" ddos. If you are reverse proxying to apache, connlimit is a MUST.

One must be careful when setting connection limits like this. A lot of people still use proxy servers and with modern browsers it quite easy to hit 20 concurrent connections per IP address.

The connlimit only affects simultaneous connections, and they all should be handled relatively quickly.

It's important to understand connlimit will cause people to queue, not to get blocked, and if 20 people are connecting all at the very exact millisecond from the same ip, well it cannot hurt to queue them for server stability.

I agree that having a simultaneous connection limit is a good safeguard. I was just noting that simply limiting the simultaneous connections to eg. 20, without considering all side-effects could lead to a world of hard-to-diagnose-and-reproduce problems and frustrated customers.

HTTP Keep-alive connections play a big role in the number of connections as well so this should be considered when choosing the number.

Don't bother reducing keepalive, just disable it altogether. Unless you have a very specific use case it is more trouble than it is worth.

A small keepalive helps prevent browsers from trying to open too many connections and reuse existing ones more efficiently from what I have seen. Nginx handles connections more efficiently than apache so it doesn't hurt. But even Apache can benefit from a couple seconds of keepalive to get less thrashing.

You can use the optional second part the keepalive_timeout setting in nginx to send timeout hints to modern browsers, ie.

   keepalive_timeout 10 10;
Some servers like litespeed have the easy ability to do keepalive for static content (ie. a series of images) and then connection close for dynamic. This behavior can be emulated by nginx with the right configuration.

> Don't bother reducing keepalive, just disable it altogether. Unless you have a very specific use case it is more trouble than it is worth.

Bad idea. This way you're actively increasing the latency of your site. This way, for each asset that has to be fetched you're forcing the client to open a new connection, which can add more than 150 ms of delay per item (thanks to the three way TCP handshake).

What I would suggest is setting the KeepAlive timeout to a value that could handle each individual page-load. This way all the page elements will have a chance to use the connections that has been already opened.

With threaded Apache (mpm_worker), I have a huge keepalive set.

It's great for an ajax-heavy site, especially when it's all behind SSL. Using a CustomLog I log total request time, from connection time to the request has been served (conditional log when the request is handled by the backend) and I can see it's halved since I could use the threaded Apache.

Currently I have 100 threads per Apache process, and ~20 of them handling 2000 idle connections. I'm sure this can be tweaked some more.

Apaches manages SSL etc. and just proxies to my application servers.

There's also an event-based Apache module which I haven't tried.

Actually, I've tested this. You get 10ms of extra delay per request, not 150.

There might be some magic value at which KeepAlive will be helpful during non-peak periods without crippling the server during peak periods, but for a well-engineered site, the extra 10ms delay per request shouldn't be a big enough deal to warrant risking a full-on site outage later on.

Also, this has already been discussed to death on HN: http://www.hnsearch.com/search#request/all&q=keepalive&#...

Light travels less than two thousand miles in 10 ms, and TCP requires three one-way trips before starting the first request on a new connection. Anybody more than 620 miles away (about half a time zone) is guaranteed to have a higher ping time than that.

You're absolutely right, I didn't think about that. However, I did perform the test(s) from Sacramento, CA to a server in Newark, New Jersey -- a distance of 2,810 miles according to Google.

I was fairly careful with the test(s), and the 10ms difference seemed to be consistent. So that's odd. I need to investigate that further.

This is excellent advice! Thank you.

If anyone owns a blog or site that they suspect may appear on HackerNews (especially if you're posting it), then please take the small amount of time to put an instance of Varnish in front of the site.

Then, ensure that Varnish is actually caching every element of the page, and that you are seeing the cache being hit consistently.

You should expect over 10,000 unique visitors within 24 hours, with most coming in the 30 minutes to 2 hours after you've hit the front page on HN.

You need not do your whole site... but definitely ensure that the key landing page can take the strain.

Unless you've put something like Varnish in front of your web servers, there's a good chance your web server is going down, especially if your pages are quite dynamic and require any processing.

Oh, Well then let me share this.

A few weeks ago I got on the front page and within a 24 hour period was hit with 29,000 unique visitors with 38,000 page views. The page itself is image heavy with 1.3 MB on first load. I'm running Wordpress with the Quick Cache plugin by PriMoThemes. I'm hosted on a shared 1and1 server.

I've been hit before and went down, that's when I installed the Quick Cache plugin. Also 1and1 moved me to another server at some point but I'm not sure if it was before or after. Either the cache plugin is really good or I'm on a rockin server all by my lonesome. Or both.

If you're self hosting a wordpress site grab the Hyper Cache plugin or the very very simple Quick Cache plugin by PriMoThemes.


Thank you. This is pretty much exactly what I read the post for and was a little disappointed not to find. People talk about getting frontpaged on HN, Slashdot or Reddit, and how you need to be sure you can handle the load, but never give any useful figures on what that load is.

Knowing that I can serve 10 requests per second and likely withstand a frontpage on HN is more useful to me than knowing that I need a way out if and when my web-servers are being totally overwhelmed.

On an unexpected trip onto the front page, here's what I saw (hits at ~1 Hz on a Friday evening in the top slot):


Graphical post-mortem here:


Agreed. I self-hosted a wordpress blog for a video-game fan site. I had a server status page that was a normal wordpress page that imported a file from the filesystem to embed in the middle (reported status of the game servers).

During some large events, I was seeing 5-7 page views per second and Wordpress did just fine (I think I had 120k page views in 24 hours). But I made sure to test as much of my site as possible to make sure it performed well. Using various page analyzers to make sure the proper headers were returned so objects were cached on the client. Tuning wordpress to cache pages properly. Turning on PHP APC (opcode cache). Running various stress tests on the site (apache benchmark and loadimpact).

I'd say the page analysis tools and extended apache-benchmark runs really helped me tune my OS and services properly so I could handle a huge load.

The standard WP caching plugin most people use is W3 Total Cache. It can require some effort to setup and get everything working but once done it will make orders of magnitude improvements to your performance.

WP Super Cache (using the htaccess option) got a client's site through 56,000 uniques / 78,000 page views on a HostGator VPS with 1.3GB RAM. The site unexpectedly got picked up by Japanese blogs as it was being taken live. Google Analytics showed up to 400 active visitors. Fun times.

AwStats screen shot: https://unavailable.s3.amazonaws.com/20121130_WP-Super-Cache...

I'll be very happy to help you and HN'ers set Varnish up on their server (not looking for compensation for this) and get you through HN traffic on your launch day.

Plug: we've built several products around Varnish so we have a good handle of how/where Varnish can be leveraged. Here's a list of varnish things we've built at unixy:

Varnish load balancer: http://www.unixy.net/advanced-hosting/varnish-load-balancer

Varnish for cPanel and DirectAdmin: http://www.unixy.net/varnish

Varnish w/ Nginx for cPanel: http://www.unixy.net/advanced-hosting/varnish-nginx-cpanel

Email in profile. I'll be more than happy to help out.


A how-to guide would be really really cool.

Varnish is very much like a programmable device. A VCL that works for one website can break the other. So it's important to know what you're up against in order to cook up the right VCL.

You could come up with a generic VCL that works for most websites out there but its cache effectiveness diminishes as you try to account for the most common corner cases. In fact, we did come up with such VCL. We distribute it with the cPanel Varnish plugin.

If you ever have a question or need a hand with Varnish/VCL drop me an email. I'll be more than happy to help out.

Cucumbertown co-founder here. Nginx was serving the cache and our sense was it was caching. But then the day before we put in csrf validation to the login form and it was bypassing the caching.

So in theory we were positioned to serve from Nginx cache.

Ah, always the kicker... small changes with a big and unseen impact.

And that's why you should have tests :)

Why does the login form need CSRF protection?

The blog mentions that they did have caching on with nginx (which is what Varnish does, isn't it?). The problems were because of nginx not caching the frontpage (configuration issue) and because there was an unexpected hit on solr.

The unexpected hit was just users using the full text search, right? That probably should have been one of the first things tested as I assume that would be a bottleneck in speed in all stages of development...or at least it has been in my experience.

I often try to think of ways to not have full text search...this case is a little more difficult, but why not create a list of common recipe names (i.e. "grilled cheese" would be a facet for all grilled cheese sandwiches) and store it as a static json? It would take some taxonomy work on the backend but the list as JsON could easily be less than a MB and the. You wouldn't have to worry about full search as much...full search could still be an option, just not a top of the front page option.

Agree, this was the Achilles heel. It was total blunder not testing full text.

In our defense this page was supposed to be HTTP cached and that didn’t happen which led to this domino effect.

I'd argue the opposite of your headline, that this was a very successful launch. Since HN isn't your target audience having your site fail from the traffic was far better than having it fail from a launch in your market. You shook out some important bugs before you lost real users. Plus you got to do this followup which will bring even more traffic.

First off, best of luck with your project. Secondly, kudos on writing the post-mortem, as I know it takes some guts to own a "failure".

I think, however, the need to write something like this speaks to an incorrection assumption: you need a "launch". Of course, TC and HN can give you a nice bump in traffic and even signups. However, in the long run, this really doesn't accomplish much for you. It gives you the kind of traffic that will likely leave and move on to the next article, skewing your metrics. There's certainly qualified prospects in there, but it's hard to decipher with all the noise.

Again, the concept of a "launch" speaks to poor business models. It really benefits businesses where the word "traction" is more important than "revenue". Build a business that provides a service that others will pay for and grow as fast as the business can bear, bringing in those visitors that are truly valuable to you.

When we did the beta, I posted the launch details on HN and you’ll be surprised by the amount of constructive feedback and users that we got. Cucumbertown now has users from devs to CEO’s who came in through HN and are now engaged users.

Cucumbertown has some notions like 'forking recipes' – called “Write a variation” which enables you to take a recipe and fork and make changes. Additionally Cucumbertown has a short hand notation way to write recipes(think stenography for recipes) – for advanced users. Things like these appeal to the HN crowd a lot.

Also, don’t you think quite a few hackers like me are also cooks!

Wow the site's really progressed a lot since the last time I saw it. I'm not surprised you're one of the most passionate people I have ever had a chance to talk to.

This is probably my favorite feature: http://www.cucumbertown.com/tribe/ (Jane Jojo's really flying ahead)

Wow, forking and shorthand are great features! I had no idea from your homepage. Maybe I'm not your target audience but you should make that clearer the moment someone lands on your homepage. ("Why we're different and maybe better than AllRecipes or XYZ: ...")

Thanks for the feedback.

Cucumbertown is very UX focused and from our research we came to a conclusion that our “aha” moment is to get you to the fastest possible way to write a recipe and give you that bout of joy. Now being hackers, we’d want to see forking & stenography in front of us. But that’s been a struggle we’ve been trying to showcase between simplicity to the “aha” moment and differentiation with others. Our primary audience comes to Cucumbertown because they are frustrated writing recipes in dropbox, google docs, wordpress & tumblr blogs as “blobs of text”.

I know as well as anyone the relative futility of relying on HN, Reddit, or TC coverage for building a successful tech product. Feedback and traffic from social news is merely a blip that says next to nothing one way or another about your long-term prospects.

However, if your site goes down for any reason a postmortem of this sort is definitely warranted. The word "launch" is not signifying much more than a point in time in this case, and I think you're jumping to a lot of conclusions about what hopes they were pinning on this event.

Right on. I second these sentiments. First, keep up the good work and best of luck moving forward. Very good that you're also reflecting on your successes and failures - always be learning.

The most challenging piece of a new business is, well, new business. And it's about growing your value proposition organically, one customer at a time, and refining the business. Analyzing bump in media attention won't really help you on that piece of the search.

Once you've nailed down the search, and you're simply focused on getting more publicity as you scale, then perhaps that sort of analysis will be of more use. But, I doubt it.

I think the author wasn't just looking at a launch per se. He was looking for feedback and HN is arguably the best place for feedback w.r.t startups

> HN community’s remarks and constructive criticism are pearls of wisdom

Perhaps, but he did call this post ".. a failed HN launch"


Those "poor business models" are making plenty of companies lots of money.

Thanks for this post, there were some nice tips in there. Although, I do have some nitpicking about your writing style. Maybe it's just me, but I found that your use of "+ve" instead of just saying "positive" and of "&" instead of "and" did not have the intended effect of speeding up reading, quite the reverse actually.

Seconded. Initially, my brain told me that +ve was the name of the site, so I was confused when I looked for a product page link only saw "Cucumbertown".

Granted, that's mostly laziness -- apparently I've got a rule that matches "strange words near the top of the post" to "probably the name of the product".

I dont know about anyone else, but when I saw "+ve", I just thought to myself, "what is that?" for about a half a second before giving up and moving on.

It doesn't even parse for me. If "+ve" means "Positive", what does "+" mean? Positi?

Regardless of what you do, a little bit of respect for English is always a good thing to have.

Thanks for the feedback. Corrected.

Call this a hacker’s laziness + Yahoo chat room era slangs.

Agree with the former, but it bears saying: you're about 2 millenia too late to be complaining about the use of the ampersand. :)

I'm not complaining on the use of the ampersand, but unlike many people seem to believe, it's not semantically equivalent to "and". Beyond just joining two items in a phrase, the ampersand marks an association between them and emphasizes it as a single definite idea, a "thing".

Ampersands are often used to mark brands, names and cultural items made up of multiple components: Johnson & Johnson, Dungeons & Dragons, bread & butter, fish & chips, Gold, Smith & Associates.

If I say "I had some fish & coleslaw" that would make a few people wonder if this is some popular recipe they should google.

Running with swap enabled is a terrible idea. The authors mention how it was only once solr crashed that they were able to actually log in and start fixing problems; having swap means that rather than the OOM killer terminating processes, instead your whole system just grinds to a halt.

(it's strange that they recommend enabling swap when they also recommend enabling reboot-on-oom, which is pretty much the complete opposite philosophy)

I think OP's post treats swap along the conventional lines i.e swap is good. I think its true for applications running on the client machine. Where you don't want an app, say Eclipse, to crash for lack of memory. And for some years that conventional wisdom stayed on the server side as well.

But modern wisdom on that is that, in general(+), it may be good to not have swap at all, on your server. Rather than address things by running parallel instances and load balancing.

As such swap space may also run out eventually if some service is leaking memory. And until it does it will make the system slow for everybody. Its better instead to let the culprit processes simply die, and make things easier for every body else.

On my server my jettys keep dying when they run out of memory. Thankfully there are lot other instances which are there to process requests.

Its a trade off you make, in favor of dropping some requests which are currently hitting the errant service (jetty instance in my case) vs. slowing down things for everybody, to the extent that even the developers can't help until something finally runs out of swap space also and dies (like the solr case you and OP mentions).

+ I say in general because there definitely could be reasons when you need swap.

Edit: Added explanation for (+)

In my experience, the linux kernel handles no swap at all very badly, so you need a small amount.

Increasing the swap, which is the suggested solution, is however, a terrible idea. As soon as you hit high memory usage, your IO load will go through the roof, and everything will grind to a halt.

The solution here is separation of services - i.e. put Solr on a different box, so that if it spirals it doesn't take out other services.

The OOM killer is your friend for recovering from horrible conditions, but as soon as you hit it or swap, somethings gone wrong.

> In my experience, the linux kernel handles no swap at all very badly, so you need a small amount.


I'm pretty sure we disable swap at Google. Maybe swap was necessary back in the days when memory was really tight, but it seems like a terrible idea now. Especially since the scheduling is completely oblivious to swap AFAIK, which means that a heavily swapped system will spend most of its timeslices just swapping program code back into memory. It's the worst kind of thrashing.

You are right. The best solution is separation of services. But for a startup than runs 7-2 services like this – it’s a close call. You’ll often have to run 2-3 services together, else $100 * 7 machines is too much burn

It's not a problem running several services on the same box as soon as each of them is sized appropriately. What I suggest at least roughly calculate how much eg. RAM could each service use at peak time. This usage should be limited so that the sum of memory used by all services at peak time is less than the amount of RAM you've got on your server.

1. So nginx didn't cache because of cookie?

2. Isn't swapping bad? I don't think I've ever had a situation in which swap more than say 100MB was helpful. Once the machine starts swapping, a bigger swap just prolongs the agony.

3. If you couldn't ssh, why didn't you just reboot the machine?


1. What did you use for the graphs?

2. What is the stack?

We use DataDog - http://www.datadoghq.com/ . It’s a statsd, graphite manifestation but with much more capabilities.

Stack is Python, Django, PostgreSql, Redis, Memcache etc.

Stress test, load test before launch!

It doesn take more then an hour, and you quickly know what your upper limits are, and where the bottlenecks are.

I use gatling in favor of JMeter: https://github.com/excilys/gatling

I find it very difficult to believe that this person worked on any sort of performance team, given that what they discovered is pretty much "Handling Load 101".

Running everything on one box? Using swap? No caching? It's like a laundry list of junior admin mistakes.

Generally that's kind of how these smaller 'startups' work... and then I get a call and charge my standard rate per hour ;)

This post mortem has me thinking about the best way to handle the situation in which you can't SSH into your server. The OP decided to trigger a kernel panic/restart on OOM errors, but I have a couple of concerns about this approach:

* If memory serves correctly, if your system runs out of memory, shouldn't the scheduler kill processes that are using too much memory? If this is the case, the system should recover from the OOM error and no restart should be needed.

* OOM errors aren't the only way to get a system into a state where you cannot SSH into a system. It would be great to have a more general solution.

* Even if you do restart, unless you had some kind of performance monitoring enabled, the system is no longer in the high-memory state so it will take a bit of digging to determine the root cause. If OOM errors are logged to syslog or something, I guess this isn't a big deal.

I suppose the best fail-safe solution is to ensure you always have one of the following:

* physical access to the system

* a way to access the console indirectly (something like VSphere comes to mind)

* Services like linode allow you to restart your system remotely, which would have been useful in this scenario

* In linux-land, there's an OOM killer (http://linux-mm.org/OOM_Killer) that would have started taking processes out. You have to exhaust swap for it to really take effect, and once you hit swap, your entire machine suddenly becomes hugely IO bound - in shared or virtual hosting environments, this usually makes the machine totally unresponsive.

* I've never seen any sort of virtual hosting service without either a remote console or a remote reboot. Usually both.

I clicked on the link to Cucumbertown and was immediately greeted with a picture of Italian seasoned chicken thighs.

I think I really like your website. I really like the simplicity of the presentation to the user.

This is a great way to make lemonade out of the lemon of getting hosed by a lot of traffic. Write an informative post-mortem and resubmit! I know I missed the original submission and clicked through to the site, and there you have it. I'd say being humble and trying again is never a bad idea.

Whatever was the real cause for your issues, Linode's default small swap space is a plague. A system starts to misbehave much gently if there is enough swap.

For a production server I think that the opposite is a better _general_ advice - reduce the available swap, because if your server gets to a point to need it, the performance will suffer so much that your server will become completely unresponsive. Having less swap will allow the OOM to kill the run-away process and allow you to login and fix the problem instead of rebooting the server or waiting in vain for it to recover by itself.

edit: typo

Drop your swap on a separate drive from your main drive that is serving your data. Solves the problem nicely.

I am currently building a site and this is definitely an experience that I can learn from. I am wondering, why was the homepage not being cached?

There was a cookie set for CSRF protection and the headers specify that content should not be cached if there is a cookie (or more precisely - the cached content includes the cookie as a cache key, so each request with a different cookie gets a cache-miss).

Slightly different.

"Note that in 0.8.44 behaviour was changed to something considered more natural. As of 0.8.44 nginx no longer caches responses with Set-Cookie header and doesn't strip this header with cache turned on (unless you instruct it to do so with proxy_hide_header and proxy_ignore_headers). "


Thanks for the clarification.

How would you circumvent this? I'm thinking that disabling CSRF is probably a bad idea. Maybe use AJAX to get the CSRF token after page load?

Developer here, since we had a click to open form at the time, we loaded the CSRF via AJAX. However that does not seem to be a good idea if we need it to work asap (and without javascript). I would look at something like SSI to put in the CSRF token to a cached page.

[ not that you asked for it here, but I've got some frontpage UI feedback: ]

I think you should put a description up front to describe what Cucumber town is. I think that main image should be a slider with multiple feature images, and I think the Latest Recipes should be the first section after this. Just my 2c!

Screen: http://cl.ly/image/3R2Y131Z433L

This is wonderful feedback. We’ll definetly look into this.

I have been having some of the same issues on a site I run ( http://www.opentestsearch.com/ ). Under heavy load solr will grind to a halt if you don't have enough ram available.

Putting a dedicated Varnish server in front of the search servers helped a lot. Using a cdn may also be a viable option, but haven't tried it myself.

Well, and this is why I recommend running solr on a standalone instance. It (and java/jetty/tomcat/etc.) are very memory hungry in general, so it is worth your while and money to spend a bit more and spin up a separate instance or whatever type of services you are using to run solr. It'll also run faster.

One last thing you can do if none of that is possible is use a better VM like Jrockit (http://www.oracle.com/technetwork/middleware/jrockit/overvie...). Jrockit with the right GC in my experience is much better about running in lower memory type situations.

That's why I like to use Heroku/EC-2 for launching new webservice. If shits hit the fan, you can jack up the processing power/database/RAM/whatever to scale to your demand. Once you have a good idea of the traffic it generates, you can then move it to a cheaper service.

Obviously, it's easy to say that when you're on the bench. Congratulations on the launch by the way.

Cucumbertown co-founder here. Actually I dislike this idea though we should have been better prepared.

At my previous firm we had this culture that whenever traffic peaks we spin up new instances. And tools like RightScale & Chef make it ridiculously simple. So our style was to do that than to optimize strains in code paths. Because this is so so convenient.

And before you know it, this notion of hardware is cheap becomes a culture. Soon enough if you grow you’ll be serving 100K users with 250 machines.

I understand what you are saying.

I do agree that Chef [and RightScale? Never used it) makes it easy to spawn new instances and through load balancing average your load.

I was talking in term of tradeoff in the first few weeks of a new service with a MVP. Obviously, you re-assess your need before you get to 100k user, and probably uses something else than EC-2.

In any case, both ideas are equally good. I don't claim better knowledge in any way.

Sounds like a good thing to me-

If it was better for your previous firm to pay more than to optimize, it's actually preferring developers time (which cost money, as you know) over servers cost.

This can be cost effective until some point. I don't think it could get to the level of "100k users on 250 servers", and if it does.

If the other side of the coin is that you waste dev time AND that your site is down for a few hours.. Is it really worth the "culture fear"?

I think throwing more resources at the problem is a quick and dirty solution for when things go downhill quickly (like what happened here), and having that option is incredibly nice. Still, it should take second priority to proper configuration tuning in the long term.

Also, in some instances of runaway memory, there will always be a point where all the memory in the world isn't enough.

Wow, nice post with real data, graphs, and helpful tips. As the Germans would say, I have nothing to complain about.

Once memory goes to swap you already lost. Personally I rarely configure swap on servers, save the DB. I would reconfigure your services to not grow past physical free memory. After that you are going to have to scale servers horizontally.


1. HN should let you pay them $10 and let them hammer your server(s) before your story goes live. good for you. good for them.

2. there's a deal at lowendbox right now for a 2GB VPS for $30 a YEAR. you could have a healthy server farm for pretty cheap.

This is really interesting, thanks for sharing. I love this kind of transparency.

Is there a way to run simulated traffic to determine how your server will react based upon heavier load to try and determine how many people it can serve?

Yes, there are. A previous comment by TeeWEE [1] mentions 2 tools. There are others as well - Siege [2] and Apache Bench [3] come to mind.

[1] http://news.ycombinator.com/item?id=4847949 [2] http://www.joedog.org/siege-home/ [3] http://httpd.apache.org/docs/2.2/programs/ab.html

Blitz is also worth a mention.


Can also connect newrelic with it and get stack traces logging how much time each component of your app is taking including DB queries.

Blitz is good, but I've found the way to build test-cases fairly limiting. I've used http://www.loadimpact.com to good success.

Addendum: Oh, load impact is pricey though.

Please, fix your CSS for mobile use. It's impossible to read because, if I zoom, the sidebar gets bigger too

well, you got it right this time :)

Your project, Cucumbertown, is a cooking/recipe site/platform/network. Hacker News is not your audience/customer. Any "launch" on Hacker News is a fail, regardless of downtime.

Couldn't disagree more. HN community is the most (brutally) honest and it's feedback can prove very valuable - regardless whether HNers are the target market or not.

Just because you like hacking/coding/engineering/hn-job-title-here, you can't enjoy cooking?

I've never thought of HN as a marketing platform. It's a place for Hackers (perhaps the "news" fails to encompass the full site) to discuss PG's concept of hacking entrepreneurs. Far too many Show HNs target developers.

Cooking often requires hacking. Engineers love to cook. Also, startups often look for cheap/simple food to sustain them.

Cooking/food articles consistently do well here.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact