
Too Many Signals – Resque on Heroku - ejlangev
http://eng.joingrouper.com/blog/2014/06/27/too-many-signals-resque-on-heroku/
======
pfg
Am I missing something, or is this solution only going to prevent running jobs
multiple times in case Resque is being shut down in an orderly fashion with
TERM? What if your instance simply dies (which could happen for any number of
reasons)?

Solving this problem can be rather complex whenever Third-Party services are
involved, but somehow this feels like you've only lowered the likelihood of
multiple job executions, which isn't something I'd be comfortable with when it
comes to things like credit card charges.

~~~
steveklabnik
Said maintainer who hasn't merged this PR yet here. The reason I haven't is
because I looked at this, went, uhhhhhh I'm not sure, and haven't had the time
to figure out if it's a good change yet. I don't want to change signal
handling and then break things for other people.

~~~
ejlangev
Solid point. My purpose here was more to start a discussion around this
problem and what we did about it rather than to say this is the only or best
solution. Also I was sure other people must be having this issue on Heroku and
couldn't find any other solutions that would work despite spending a good
amount of time searching for one.

~~~
steveklabnik
Absolutely! I'm also not saying you're wrong, I'm just saying please don't
take my failures as a maintainer as a signal either way. I appreciate the
patch.

------
zo1
Could someone enlighten me and explain _why_ Heroku sends TERM signals to the
running processes? Doesn't sound very healthy, that's for sure. Nor something
I'd personally tolerate from someone I'm purchasing a "cloud hosting" service
from.

Is it simply a case that this is the way that Heroku responds to being told to
shutdown an instance? If so, why isn't the managing app that sends the
shutdown call to the instance also handling the graceful "shutdown" of the
processes on that instance?

~~~
davetron5000
Heroku sends them as a normal course of operations. Dynos get cycled daily.
Why…not sure, but it happens and is well-documented by them that it happens.
It's likely impossible to completely insulate against it, but if you design
your jobs to be idempotent and safely retriable, rather than try to trap their
signals, your jobs will be a lot more bullet-proof

~~~
zo1
" _Heroku sends them as a normal course of operations._ " Wow, I didn't
actually know that. Makes me glad I didn't pick Heroku recently for one of my
mini-projects. It requires long-running processes.

I guess Heroku "dynos" are more suited for "worker" type jobs, then. In which
case, sending the TERM signal to all processes isn't necessarily a really bad
way of notifying the worker to shut down. Although, we are in 2014, and I
don't see why they can't easily come up with a more robust solution. Even if
it's in the form of a "shut-down" process, or giving the worker more than 10s
to shutdown.

------
mperham
Author of Sidekiq here.

I sympathize. I've spent a heckuva lot of time getting clean shutdown working
well (and someone just fixed a rare but persistent issue this morning!).
There's a lot of edge cases. Steve and the Resque team are doing the right
thing: you don't want a fix for one edge case to break another and this stuff
is near impossible to test.

~~~
neodude
Mike, I'm curious if you think Sidekiq suffers from a similar issue on Heroku,
and what the solutions - ideas or already implemented - look like?

~~~
mperham
AFAIK this problem is endemic to any job processing system where jobs can take
more than N seconds to process. What Heroku does:

    
    
      * Heroku sends the TERM signal.
      * The process has 10 seconds to exit itself.
      * After 10 seconds, the KILL signal is sent to terminate the process without notice.
    

Sidekiq does this:

    
    
      * Upon TERM, the job fetcher thread is halted immediately so no more work is started.
      * Sidekiq waits 8 seconds for any busy Processors to finish their job.
      * After 8 seconds, Sidekiq::Shutdown is raised on each busy Processor.  The corresponding jobs are pushed back to Redis so they can be restarted later.  This must be done within 2 seconds.
      * Sidekiq exits or is KILLed.

------
dylanz
Great post, and this is something we've faced as well. Luckily our jobs are
mainly idempotent, and the ones which are not, aren't that critical. This is a
pretty nice solution! Ethan, the errors you still see from jobs that take more
than PRE_TERM_TIMEOUT seconds... I'm assuming that's a separate, job specific
issue, like talking to timing out external services/etc?

I noticed the "wait 5 seconds, and then a KILL signal if it has not quit"
comment in the code above the new_kill_child method. Without jumping into the
code, is the normal process sending a TERM, then forcing a KILL after 5
seconds? Just curious.

~~~
ejlangev
Yeah it tends to be from unresponsive external web services that crop up every
once in a while. Having a couple of jobs that fail that way isn't the end of
the world for us event if we don't retry them.

Yes, the situation you're describing is the RESQUE_TERM_TIMEOUT option which
dictates how long the parent process waits to send a KILL signal after it send
the TERM signal to the child. On Heroku you want that to be less than 10
seconds (and in practice more like 8 at max) otherwise heroku will terminate
both processes with a KILL signal at the same time.

------
JonnieCache
I'm currently trying to decide if I should implement Resque again, or opt for
RabbitMQ instead due to this and similar compromises which stem from the
ruby/redis combination. What would people say are the major differences, the
major pros and cons between the two systems? I'm dealing with simple
intermittent, longish running jobs which seem well suited to resque, but I
can't shake the feeling that I might be better off with rabbit/0mq/etc.

Obviously resque is closer to a "turnkey" solution and so forth, but what are
the real fundamental differences?

~~~
cheald
RabbitMQ has its own set of durability issues (see the recent Jepsen writeup
on it), but if the data store itself is stable, then it's really very good.

The primary difference you'll notice is that RMQ has an explicit-ack mode. It
will send a message to a client, the client processes it and sends an explicit
ack (message consumed), at which point RMQ will send the next message. The
client can also send a nack (push the job back onto the queue and redeliver
it), and if the connection is dropped without the job being ack'd, then RMQ
will requeue it and send it to another client.

If you're performing all your state mutations in a transaction or something
similar that rolls back when a worker terminates, then you can avoid losing
jobs and ending up in invalid state even during non-clean shutdowns.

As far as other notable changes go, you can have multi-queue routing (one
message can be routed into multiple queues) and dead letter exchanges (so that
TTL expired messages can be sent to a different queue rather than just being
dropped). There's a lot more to it, as well; as a message queue, I do think
that RMQ is flatly superior to Redis, but Redis has drop-dead simplicity going
for it that is really nice if you don't need the extra features RMQ offers.

------
driverdan
As someone who literally just implemented Resque for our Heroku hosted app
yesterday and is about to add billing to it I'd like to know a little more.
What percentage of jobs end up getting killed? Are you flagging those jobs
somehow so that you can rerun them and check if the 3rd party service already
received them?

~~~
cmelbye
He explained this in the post a little bit, but when you deploy to Heroku,
scale down dynes, etc., Resque workers will be killed, and if they're
processing a job the job will be killed. He also mentioned that they use
Resque Retry to retry the jobs that were killed. You just need to trap the
signal and perform cleanup, which is typically something you should be doing.

~~~
davetron5000
Even if Heroku is working 100% normally and your code is working 100%
normally, your jobs will get killed. The workers get SIGTERM'ed once per day
minimum as dynos cycle. The more workers you have in flight on average, the
more you will see this. The best thing to do is make your jobs retriable,
meaning they are idempotent or otherwise can pick up where they left off. Then
use resque-retry to have them automatically retry. That's what we've done, and
now the only failed jobs we get are legit issues and not Heroku

------
chrislloyd
Have you considered using something like
[https://github.com/chanks/que/blob/master/README.md](https://github.com/chanks/que/blob/master/README.md)
for critical jobs?

~~~
ejlangev
I haven't seen that particular project before, it would solve the problem for
changes to the local database but I don't think it's a solution for jobs that
talk to external web services. Unless I'm missing something?

~~~
chanks
Hi, I'm the author of Que. It's true that you can't really completely solve
the idempotence problem for jobs that write to external web services (unless
those web services provide ways for you to check whether you've already
performed a write - see the guide to writing reliable jobs in the /docs
directory), but that's a limitation that'll apply to any queuing system. I'd
definitely say that Que, being transactional and backed by Postgres'
durability guarantees, does give you better tooling for writing reliable jobs
than a Redis-backed queue would in general.

I'm happy to answer any questions you or anyone else might have.

------
stevewilhelm
Has anyone successfully replaced Resque with RabbitMQ to solve these type of
issues?

~~~
davetron5000
We use both and RabbitMQ is not a solution for this problem. Message
handlers/listeners are equally susceptible to this problem on Heroku (or,
generally, to being killed).

RabbitMQ can be configured to not ack messages where an exception was raised,
so if you have a durable store and the code responding to messages is
idempotent/retriable ,you are good to go. Such a system can be easily
configured with resque jobs using resque-retry, so it's mostly down to how you
design your jobs/listeners/message handlers and not the underlying tech

------
taf2
It sounds like you should not [edit] cannot [/edit] rely on heroku for things
like long running background jobs...

------
AznHisoka
"Heroku reserves the right to send TERM signals to any dyno whenever it wants.
"

I stopped reading right there, and thought to myself: Thank God I didn't
choose Heroku as my service provider. Overpriced, and underpredictable.

------
drunkcatsdgaf
heroku strikes again

