The Switch from Heroku to Hardware

jacques_chester · on Aug 31, 2012

So basically nothing has gone wrong yet, therefore it's a slam dunk case for Doing It Yourself.

I'm sorry, but no.

You don't pay an ops guy or Heroku buckets of $$ for when things are going well, just as you don't pay $$ for software that only handles the happy case.

You pay $$ for someone who has fixed shit that went horribly wrong and has the scars to prove it. "That deep purple welt on my lower ego ... is where I only had a backup script and never tested the backups. This interesting zigzag is where I learnt about things that can wrong with heartbeat protocols ..."

Edit: though see below for a more nuanced discussion of reasons from OP.

zeeg · on Aug 31, 2012

You're saying before Heroku (or the cloud) everyone paid an operations guy to do anything?

Nope.

I don't need to suffer the same acts of terror operations people have gone through to be able to avoid, prevent, or recover from them. I'm paying myself to be the operations guy, as well as the engineer.

jacques_chester · on Aug 31, 2012

It's about value delivered and opportunity cost paid.

You're betting foregone engineering time against Heroku's value-as-delivered.

I don't think that the hours you will inevitably spend fixing the ops infrastructure will turn out to be very profitable.

For a large business, moving off Heroku to inhouse operations is probably justifiable, because they can capture sufficient value from a smart ops team to offset the potentially very high monthly bill Heroku would levy for their business.

But for a smaller firm, Heroku so abundantly oversupplies value compared to the bill that you will simply never be able to capture that value from internal work at anything like the same price.

It's like saying "I should stop buying food from the supermarket, it would be cheaper to grow my own vegetables!"

In a naive dollar-cost analysis this is true. In an actual consideration of whether raising a vege patch, bearing the risks of starvation, doing it poorly because you're new at it, spending the hundreds of hours of labour it will require -- means that letting a professional farmer do it is much smarter.

Gains from trade are gains.

I have a pretty strong opinion on this given how much time I've already sunk on similar work: http://chester.id.au/2012/06/27/a-not-sobrief-aside-on-reign...

And for my secondary startup I am thinking I will just let Heroku handle it.

on Aug 31, 2012

[deleted]

jacques_chester · on Aug 31, 2012

The blog post doesn't explicitly mention the performance problems that you are now saying prompted your move.

Then you go on to say it was simple and talk about how much cheaper it is.

I feel like I was replying to what you said in the first instance, not the more interesting underlying cause.

Did you talk to Heroku about your performance problems? I'd be interested to see how much leeway they would grant.

Edit: for people wondering why the comment was deleted, I think because zeeg accidentally replied to me instead of a different poster; nothing silly going on.

zeeg · on Aug 31, 2012

I talked to them (the people I knew) some, but it was mostly characteristics of the app.

So various events were like this:

* Perf problem w/ the code (e.g. didnt handle this kind of spike)

* Perf problem with service (e.g. had $200 db instead of $400)

* Couldnt max cpu due to lack of memory

* Couldnt max cpu due to IO issues (db)

* Couldnt maintain a reasonable queue (had to use RedisToGo which is far from cheap)

The biggest one I couldnt get around was my queue workers required too much memory to operate (likely because of them dealing with larger JSON loads). Too much was like 600mb (or something along those lines) total on the Dyno (not just from the process). I routinely saw in the Heroku logs "using 200% of memory" etc, and thats when things would start going down hill.

Things could have been a lot better if I had more insights into the capacity/usage on Dynos (without something like NewRelic, which doesnt give it well enough)

A great analogy for me is this:

If SQL isnt scaling there are several options:

1. Stop using it (switch to another DB)

2. Shard it

3. Buy better hardware

Guess which one we always go to first? :)

jacques_chester · on Aug 31, 2012

Thanks, this makes more sense to me.

Looks like I misread your purpose.

fdr · on Aug 31, 2012

Speaking of backups, you might want to run wal-e:

https://github.com/heroku/wal-e

It is not the best program I ever principally authored (since I know you are very discriminating Python programmer), however, it does work very well and has an excellent reliability record so far, seemingly for both Heroku and users of wal-e per most reports. I also tried quite hard -- and in this I'm pretty satisfied given both the feedback I've gotten and my personal experiences -- to make the program easy to use and administer, for one server and operator to many servers and one operator.

If someone is not doing continuous archiving and can't use our service for any reason, we do try to urge them to use something like wal-e or any of the other continuous archiving options. And to that end, I tried to make a pretty credible wal-e set-up only take a few minutes for someone who already knows how to install Python programs (I would love for someone to contribute more credible packaging).

eckyptang · on Aug 31, 2012

Actually before you embark on such things, you should probably be "the scarred person who knows how to fix shit that goes wrong". Then you know which things are likely to hurt you and you can avoid them.

To be in control of your venture means knowing all the corner cases.

I'd trust dedicated hardware more than Heroku as when it does go wrong, you're at the mercy of yourself and not others (other than the colo facility).

timr · on Aug 31, 2012

"On the limited hardware I run for getsentry.com, that is, two servers that actually service requests (one database, one app), we’ve serviced around 25 million requests since August 1st, doing anywhere from 500k to 2 million in a single day. That isn’t that much traffic, but what’s important is it services those requests very quickly, and is using very little of the resources that are dedicated to it. In the end, this means that Sentry’s revenue will grow much more quickly than it’s monthly bill will."

There's nothing in this justification that doesn't also apply to Heroku. The cost structures just aren't significantly different at a two machine scale. However, as people keep pointing out, the roll-your-own-cloud approach requires that you build and maintain a bunch of infrastructure that Heroku has already built for you, or you forego redundancy and fault tolerance that Heroku has already built for you.

The best lines of code are the ones you don't have to write.

zeeg · on Aug 31, 2012

I didn't note in this post (but in others), but before I switched, my Heroku bill was almost $700 (and I couldnt get it to perform well), the current bill is far less even with growth.

timr · on Aug 31, 2012

Yeah, I read your original post. I've worked pretty extensively with Heroku and with large, custom, in-house infrastructure, and I don't share your experience.

There's an I/O penalty for working on AWS, but it's on the order of tens of percent, not hundreds. I suspect that your original problems were related to working set size relative to cache (since Ronin => Fugu bumps cache by over 2GB, and you said that Fugu was working well).

Heroku's largest database has a 68GB cache at an (admittedly expensive) $6,400 a month. But even so, $6,400 is a small expense for a growing web application. A mediocre developer costs more than that. Trading off server cost for developer cost is an asymptotically bad bet.

zeeg · on Aug 31, 2012

Actually I think I/O was my primary bottleneck. Once I addressed that I started hitting CPU/memory constraints on Dynos.

The database definitely wasnt unreasonable at $400, but for a bootstrapped project (especially something thats a side project for me), that was a big consideration.

I probably would have toughed it out with Heroku if I could have gotten things to perform better. At one point I was running 20 dynos trying to get enough CPU for worker tasks to actually keep up, and I unfortunately couldn't solve the bottleneck to where the cost was reasonable.

The application isn't typical (what Sentry does pushes some boundaries of SQL storage for starters), but it was costing too much of my time to struggle with optimizing something that really shouldn't have needed that much effort.

I definitely like the redundancy provided, and the ability to add application servers with zero-thought is a huge plus, I just couldn't justify the cost of the service in addition to my frustrations/time of trying to scale it on Dynos.

timr · on Aug 31, 2012

"Actually I think I/O was my primary bottleneck. Once I addressed that I started hitting CPU/memory constraints on Dynos....At one point I was running 20 dynos trying to get enough CPU for worker tasks to actually keep up"

Something doesn't make sense. If you "addressed" your I/O problem, your CPUs were therefore all busy doing something much, much slower than a disk read/write, in software (which would have to be both obvious and unbelievably horrible). If that's true, something pathological was going on in your code. I'm going to assume that you would have noticed it -- swapping, for example.

So let's go back to I/O: if your database was slow, you might observe something superficially similar to what you've described: throwing lots of extra CPUs at the problem would result in lots of blocked request threads, and appear that your dynos were all pegged. The exact symptoms would depend on your database connection code, and your monitoring tools. But in no case would throwing more dynos at a slow database make sense, so I'm going to assume that you didn't do that on purpose (right?)

Given the above, I still can't meet you at the conclusion that abandoning Heroku was the magic bullet for your problems. There's not enough information, and it doesn't add up. My money is on one or more of the following: DB cache misses (i.e. not enough cache); a heavy DB write load; frequent, small writes to an indexed table; or pathological memory usage on your web nodes. And if it turns out that the cause is due to I/O, you've only bought yourself a temporary respite by moving off Heroku. Eventually, you'll get big enough that the problem will re-emerge, even though your homebuilt servers are 10% faster (or whatever).

EDIT: Aha! Your comment in another thread actually explains your problem: you were swapping your web nodes by using more than 500MB RAM (http://news.ycombinator.com/item?id=4458657).

zeeg · on Aug 31, 2012

It would take me more than one blog post to describe the architecture that powers Sentry, the various areas that have and can have bottlenecks (some more obvious than others).

More importantly, this a few months ago I made the switch, and I don't remember the specifics of the order of events. I can assure you though that I know a little something about a little something, and I wasnt imagining problems.

(Replied to the wrong post originally, I fail at HN)

http://news.ycombinator.com/item?id=4458643

timr · on Aug 31, 2012

"I can assure you though that I know a little something about something, and I wasn't imagining problems."

Since you've made it clear in another thread that you were actually running out of RAM on your dynos, I imagine you were running into trouble. There's no need to be snide about it.

Bottom line: you hit an arbitrary limit in the platform. If heroku had high-memory dynos, the calculus would be different. In the future, instead of arguing that your homebrew system is better than "the cloud", you could just present the actual justification for your choice.

moe · on Aug 31, 2012

There's an I/O penalty for working on AWS, but it's on the order of tens of percent, not hundreds.

That's rather optimistic.

The EC2 ephemeral disks normally clock in at 6-7ms latency, that's >13x slower than dedicated disks.

EBS clocks in at 70-200ms latency, that's >5x slower than a dedicated SAN.

And that is under optimal conditions. In reality the I/O performance on EC2 frequently degrades by orders of magnitude for long periods of time.

timr · on Aug 31, 2012

"The EC2 ephemeral disks normally clock in at 6-7ms latency, that's >13x slower than dedicated disks."

Um. Are you comparing hard drives to SSDs? Rotational latency for a 15k drive is a couple of milliseconds. Seek time for server drives varies from 3-10ms.

EC2 disks are slow, but there's no way they're 13 fold slower than your average server drives. And 6-7ms is just about on par with commodity hardware.

moe · on Sept 1, 2012

Are you comparing hard drives to SSDs?

No, but I mistyped the numbers. That was supposed to read: 60-70ms.

kingrolo · on Aug 31, 2012

I'm okay with the concept of trading cost/time for convenience, and why that might work for some folks, but even before we get to that argument, my experience with Heroku is that it just isn't reliable enough for client sites. Since reviewing it (over the last few months) we've had a couple of instances of real downtime (ie, greater than 30 mins), and a few spots of smaller amounts (a few mins each). We don't get that from our Linode+Puppet sites (taught myself Puppet as we went along, I'm a dev rather than an ops guy really).

taligent · on Aug 31, 2012

Well I guess you are lucky then because Linode has had massive downtimes in their London and Dallas data centres in the last few days and when I was a customer their Fremont data centre went down all the time.

http://status.linode.com

Heroku at least uses EC2 which will be far more reliable over time.

citricsquid · on Aug 31, 2012

Not sure if the London one was really "massive", I have 12 Linodes in London all on different machines and I had one become unavailable during the downtime a few days ago and it was resolved very quickly.

slurgfest · on Aug 31, 2012

Heroku goes down every time US-East goes down, so it isn't true that its reliability is the same as EC2's in general

josephlord · on Aug 31, 2012

This stuff is really useful even if you don't want to move off Heroku. Just knowing that you can is really reassuring in case they stop offering what you need. Simple how to guides are even better.

It was the fact that Heroku offer an environment to run fairly standard Rails + Postgres that made me pick them over the more unique and harder to move from Google platform. Even though I was starting from scratch.

It's always good to have an exit route.

dangrossman · on Aug 31, 2012

A bit off topic but this is a good thread to ask in:

Those of you running startups that don't colocate your own hardware, and don't run in a cloud, where do you rent servers from these days?

Most of my stuff is at Softlayer, but their RAM pricing is killer ($25/mo/GB).

corford · on Aug 31, 2012

It might not work for you if you're US based (and want your servers in a datacentre you can easily fly/drive to) but I hear a lot of good things about www.hetzner.de and the prices are fantastic. I plan to use them for the venture I'm currently working on.

boundlessdreamz · on Aug 31, 2012

Hetzner.de is awesome. They are cheap and reliable which is a combination you don't see often

dangrossman · on Sept 1, 2012

Unfortunately, being in Germany and using desktop-grade hardware means high latency and (comparatively) high failure rates. That's why it's cheap.

corford · on Sept 2, 2012

Germany isn't badly connected and any extra latency can be largely mitigated by using a CDN for your static assets.

As for desktop-grade hardware - I haven't seen this mentioned by anyone else. Any links to back it up?

If that's true (and I doubt it is), it's avoidable if you use their extremely well priced colo option (which is what I'm going for).

dangrossman · on Sept 2, 2012

> are you referring to the latency

Yes, "high latency" was referring to the latency. A CDN for static assets doesn't mitigate the fact that the initial request, and everything dynamic is on the wrong side of the planet for most startups' users.

> Any links to back it up?

Hetzner.de. They list the hardware in the boxes. It's desktop processors, desktop motherboards, desktop hard drives and non-ECC RAM in all the cheap server lines.

http://www.hetzner.de/en/hosting/produktmatrix/rootserver-pr...

7200RPM hard drives, Core/Athlon processors and non-ECC RAM don't belong in servers.

http://arstechnica.com/business/2009/10/dram-study-turns-ass...

According to Google's study, you would expect about 2 memory errors per day per server running 24/7. You need ECC RAM.

> I doubt it is

It's rude to publicly call someone a liar without evidence.

corford · on Sept 2, 2012

I think you're over blowing the latency problem. European users interact with US hosted startups all day long and it's not like you hear us complaining.

Re: hardware. The choice is nice to have and their EX6 and up packages are "proper" server grade (as in Xeon and ECC). EX6 machines start at EUR69 which is fantastic if you ask me.

Edit: as for rudeness - chalk it up more to a disagreement over what constitutes "desktop-grade hardware" (and your omission of their higher level hardware offerings)

zeeg · on Aug 31, 2012

SL has become very expensive in the last couple of years (we host primarily there at Disqus).

For Sentry I'm using Incero, but will likely be switching to Hivelocity (or something similar), as I currently don't have internal network and that's a big, annoying deal for me.

Also, if you're loooking for deals or more information on hosts: webhostingtalk.com

benologist · on Aug 31, 2012

Funny that we'd run into each other here again - Hivelocity is super cheap, I still have a massive, single server there and have used them for ~3 years now and actually all those servers I replaced were with them! They usually have really good specials, incredible value. But their live support is useless, the best thing they can do is log tickets on your behalf for the real support. Most of the time that doesn't matter.

SoftLayer have much better support, geographically diverse locations and a wicked suite of offerings. They're also giving startups $1000/month credit for 12 months, I got almost instantly accepted... I was still asking questions and then the rep was all "I made you an account here's the details".

https://www.softlayer.com/partners/catalyst

zeeg · on Aug 31, 2012

I wasnt aware about that credit, that's actually really nice!

I used to have a really good deal with SL about 2-3 years ago, and when I was looking around for Sentry they were my first stop. Unfortunately the prices had gone up a lot while I was gone, and for a reasonable DB server it was looking to be pretty pricey.

darkarmani · on Aug 31, 2012

How many nodes did you say you were going to use that you got accepted so fast? I've been using Rackspace Cloud and AWS for small demo deploys, but we are still shopping around for a cloud provider.

benologist · on Aug 31, 2012

It may have helped that we were already on ~12 servers but I don't think they're discriminating, they have terms like unused credit not rolling over so they can't be expecting everyone to use it all.

Negitivefrags · on Aug 31, 2012

Softlayer will drop their RAM price to $12.50/mo/GB just for asking. Seriously, just ring up and they will offer it to you in a heartbeat.

After a bit more chatting with a sales guy I managed to get the price per gig of RAM well below that.

Softlayer is a bit expensive out of the box, but I've found them very open to negotiation.

hk_kh · on Aug 31, 2012

I started with two Linode basic vps. I keep them just because their customer support is great, and even let me stay one month without paying when I had problems with money. Also, all inbound traffic is free (basically it's what these two machines do).

Now, from UP2VPS[1] you can get a pretty good deal (1GB memory, 1TB transfer) for just $6/month.

[1]: http://up2vps.com/

pooriaazimi · on Sept 1, 2012

Their customer service is "dreadful". The support guys are as unhelpful as it can be.

But if you don't need a support, their service is very affordable and good.

count · on Aug 31, 2012

"DNS being slow (fuck it, use IPs)"

So, yeah, ops isn't hard at all, if you don't fucking take the time to do it right.

donavanm · on Aug 31, 2012

"DNS being slow (fuck it, use IPs)" GAAAHH!! <table flip>. The 1980s called, they said youre an f'n idiot. Seriously, this is not "operations". This is "phoning it in" and/or "not understanding the problem".

Ps: to the op, not parent.

zeeg · on Aug 31, 2012

The problem is DNS resolution isnt cached and potentially has to do a lookup everytime it does a connection. When I was doing that every request (potentially several times), it became expensive. Sounds like I understand it?

That DNS mapped to one node, which will rarely change. If it does, I can spend 30 seconds and deploy a new config, rather than worry about using virtual IPs or anything more complex, let alone with having to configure a cache which would have the exact same problems (delay to change).

Guess what, it's 2012 and the same shit that worked back then works just fine now.

There's a fine line between doing things right, and doing things just to do them.

donavanm · on Aug 31, 2012

"The problem is DNS resolution isnt cached" I don't get it. It's like using http withou content-length or content-md5 and complaining about a lack of data integrity. A DNS cache is trivial, either in your code or a local service. Or hell just use your local gethostbyname(). Your os will probably cache for you. And if it's not you'll pay 0.1ms to hit another box in your dc. But if 30ms every few minutes is still too much use a pref etching resolver like (lib)unbound.

I suppose my real objection is in not Doing The Right Thing. Yes, it may work for your current two box static setup. But small choices contribute a lot of debt that you or someone else has to pay off later. Would be a shame if someone took the message that DNS was just "overhead".

zeeg · on Aug 31, 2012

It takes just as much time for DNS to propagate as it does for me to deploy a config change. It literally doesn't matter at this stage.

koko775 · on Aug 31, 2012

Run DNSMasq locally (as in, same datacenter as the computers that will be using it) and tell it to cache. It's dead-simple to set up. Then point your computers to resolve using it.

You can even add to /etc/hosts and the computers using it as their DNS will resolve it. Depending on how much control you have, DNSMasq will also function as a DHCP server and TFTP server from which you can netboot other servers and do such nifty thing as automatic reinstalls. Useful if you have a separate, internal network and want to set internal IPs, too.

gizzlon · on Aug 31, 2012

DNSMasq is nice, it's so easy to make up your own, local, dns names. Do you run several to avoid a single point of failure, or do you just fall back to the "real" dns?

Even with a local DNS server, there has to be some overhead though.. OTOH, avoid premature optimization etc..

davestheraves · on Aug 31, 2012

Sounds to me that the main issue was running out of RAM on workers. Would this not be soved by moving to another cloud provider (such as AWS) where you are not limited to the tiny RAM provided on Heroku?

Is this not similar in strategy to choosing hardware and spinning up your own stack?

EDIT: I getting at the rather sweeping statement against all cloud providers based on a specific Heroku problem

Xylakant · on Aug 31, 2012

The moment you do this, you forego one of the biggest advantages of using heroku. As long as you're on heroku only, you don't need to take care of securing the underlying stack - that is firewall rules, OS-updates, general maintenance. The moment you spin up a single AWS instance besides it, it's your problem. Depending on your use-case it could be a better choice to just go all the way to dedicated hardware: The primary advantage AWS has over dedicated hardware is flexibility. You can spin up instances depending on your current need. If your load behavior is a flat, predictable curve you might just not need that - and then real hardware is cheaper in most cases.

mark_l_watson · on Aug 31, 2012

I had the same thought: why not run the workers on a few large memory EC2s and use Heroku for everything else. Run on east coast region for free bandwidth between Heroku and EC2s.

papercruncher · on Aug 31, 2012

Not directly relevant to the blog post, but Sentry is an amazing piece of software. We're running it on an m1.small AWS instance along with a bunch of other stuff and it is rock solid.

JDavo · on Aug 31, 2012

My company is self hosting sentry for about 30 Python & Java in-house services and it was the easiest deployment of anything that's taken me more than apt-get install to deploy.

kuanghe · on Aug 31, 2012

test,hahahaha