

Heroku Dynos Unable to Start - mparramon
https://status.heroku.com/incidents/642

======
sparkman55
I know it's popular to be a "developer-centric" organization these days, but
please, please, don't schedule risky maintenance operations for 10 AM on a
Monday, when your tools are in heavy usage.

Every SaaS product I've built has analyzed traffic, and performed
migrations/deployments at off-hours (generally, after midnight in our dominant
time zone). In this case, the outage may have resulted in hundreds
(thousands?) of paged administrators across the world, but at least fewer end-
users would have been affected.

Also, at scale, it's a good idea to deploy to a single cluster/zone first, and
check error rates before deploying to the larger environment.

It's pretty scary that the 'professionals' to whom I've trusted my business
aren't more savvy when it comes to high availability...

~~~
kawsper
Are you sure that Heroku is peaking 10 AM on a Monday?

I believe they have off-hours, but as an international product it could be
interesting to know when. I know a lot of European products with European
users that all depend on Heroku.

~~~
zzen
They have a separate EU hosting region that has separate maintenance (and
uptime that's actually significantly better then US region):
[https://status.heroku.com/uptime?region=EU](https://status.heroku.com/uptime?region=EU)

So yeah, it was a lame time to schedule maintenance.

~~~
alrs
I disagree, strongly.

If you've architected for IAAS you have sufficient redundancy and a plan for
graceful degradation. I'd rather be dealing with an outage after lunch than at
4am.

~~~
sparkman55
This is exactly the "developer-centric" mindset that is, frankly, misguided.

Sure, I would rather deal with an outage during business hours, but that means
that many of my _Customers_ are also dealing with an outage. If the outage
were at 4 AM, most of my _Customers_ would be asleep, and wouldn't even notice
the outage.

~~~
alrs
If you want NASA reliability, build a NASA-grade 24/7 operation across three
timezones.

If your plan is just to have the nerds skip sleep periodically, you're an
asshole.

~~~
sparkman55
Exactly! Heroku should be building that 24/7 operation so I don't have to.
Isn't that the whole point of PaaS?

~~~
alrs
I'm sure that's the plan. But if you _really_ _really_ _really_ need faultless
24/7 you're looking for something akin to "NASA as a service", and I assume
that such a service wouldn't be able to invoice using something as pedestrian
as a credit card.

------
pardner
Compounding the issue of irresponsibly scheduling maintenance for US zones
mid-day in the US, they ALSO broke ability to scale worker processes to zero,
so there was literally NO way to wind your app down and prevent worker jobs
from firing off.

When the platform started getting wonky we shut off our worker dynos so the
system would not fire off emails while the system is in a known-screwed-up
state.

The console said workers were set to zero.

But... in our logs we watched the (supposedly off) workers continue to fire
off.

Nice FUBAR all around, Heroku.

------
sprite
Started getting tons of emails from end users. Went to investigate and was
greeted with this:

! ! Heroku has temporarily disabled this feature, please try again shortly. !
See [http://status.heroku.com](http://status.heroku.com) for current Heroku
platform status.

Hopefully they will be backup up soon. Also wonder if we will get any sort of
reimbursement? I currently spend around $3k/month with Heroku.

------
silasb
Makes zero sense to do an 2 hour update on a Monday at 10 AM PDT.

Horrible horrible timing. I'll likely be getting blamed for this since I
recommended Heroku.

~~~
alrs
Do you want to start working night shifts? The people who know how do this
stuff command six-figure salaries and are highly in demand.

Every startup is looking for the design/dev/ops unicorn. You want to start
trying to find nocturnal unicorns?

~~~
stevepike
These don't need to be night shifts. Back when I worked in bigco, we had very
competent ops teams around the world - we'd even send developers abroad for
3-12 month periods to train with them and share information.

Smaller companies use Heroku so we don't have to build the same expertise in
house. They're able to charge a premium not because their developers have to
work crazy hours, but because there's a certain base cost to having these
kinds of distributed teams.

------
guywithabike
This was caused by their scheduled maintenance:
[https://status.heroku.com/incidents/641](https://status.heroku.com/incidents/641)

I think they jinxed themselves: "Running apps will not be affected."

------
sprite
Seems everything is returning to normal now. Here is a New Relic screenshot
from the outage:
[http://i.imgur.com/WB7U2mz.png](http://i.imgur.com/WB7U2mz.png)

------
shravan
Request queueing on New Relic is going haywire on our app right now.

~~~
maxisnow
We had a similar problem on our app, but were able to restart with more dynos.

