
Faktory, a new background job system - mperham
http://www.mikeperham.com/2017/10/24/introducing-faktory/
======
jitl
It seems nuts to me to introduce a stateful queue system like this where the
persistence story is “data’s out there on a single node in some kind of disk
format, hope it doesn’t go missing!”

I guess this style significantly reduces setup friction, but it’s almost an
irresponsible design in a universe where the average cloud provider is telling
you, “your compute instances may vanish at any time.” If this used Kafka,
Redis, MySQL, or another well-known stand-alone data store, I know my data is
replicated, and I can recover my enqueued jobs if Somethimg Bad happens.

I like the nice UI and the simple API, but there’s no way this will replace
Resque or any Kafka-based-job-whatever at my place of work.

~~~
mperham
It supports backup and restore today but RocksDB doesn't have a replication
story at the moment.

Side note: Resque will lose jobs if it crashes, it doesn't use RPOPLPUSH. I
hope you'll take another look at Faktory in a few months, maybe we'll have
addressed your issues.

~~~
koolba
Using RPOPLPUSH there's no guarantee you won't lose the job if the Redis
server crashes prior to publishing it. Even with persistence enabled (AOF)
there's still a small window of time between receiving the message and the
next AOF fsync. The sender would think the message is queued as it got an OK
response back but the message will be lost.

There are three common solutions to this:

First is to pretend it doesn't exist (quite common!).

Second is to understand that losing messages is a possibility and only use it
for message types that the loss of a message wouldn't be critical (which
covers quite a bit).

The third solution, which actually addresses the problem, separates the
persistence of message details to transactional store from the event
notification of a new message. A "sweeper" type task can thing check the
persisted message list for messages that haven't been processed and re-publish
them.

( _Disclaimer: I 'm a huge fan of Redis and think it's the bee's knees of data
structure servers._)

~~~
elvinyung
You mean RDB? I thought AOF can be configured to fsync at every write.

~~~
koolba
The default is to fsync once per second so the failure window is small but
it's not zero.

It's possible to fsync on every write[1] but it may be too slow as the single
threaded nature of Redis means you've serialized every write operation. Plus
that'd apply to all usage of that Redis server, not just the queue.

[1]: [https://redis.io/topics/persistence#how-durable-is-the-
appen...](https://redis.io/topics/persistence#how-durable-is-the-append-only-
file)

------
manigandham
You can just use Redis, Kafka, or better yet use Google's Cloud Pub/Sub or
Azure Service Bus for an extremely cheap and highly-reliable system that
already has all of these features included.

What exactly is the use-case for an entirely new and separate system that
looks like it's single-node only?

~~~
karmajunkie
I had a similar question, given that I'm happily processing jobs in Elixir
that are enqueued by a Ruby app via Sidekiq. After taking a look at the
README's, it looks like what this gets you is dumb clients, with the logic
around retries, etc built into the server, rather than forcing the client to
handle these details. That's great for polyglot systems—the dumber the client
the better, in my book. As another commenter posted, I'll be interested to see
more about what tradeoffs it makes in distribution.

If your current system is already something like SNS/SQS or an actual message
queue with acks, then this probably isn't aimed at you.

~~~
manigandham
Why run something extra when you can use a single database table which is
already universally accessible by any dumb client?

Push, fetch, ack = insert, select, delete. A few lines of SQL or a stored
procedure gets it done. Postgres basically made _skip locked_ for queues.

~~~
karmajunkie
Its a valid way to go, but does skip locked handle retries for you? There's
more to a queue than just the message.

Also worth noting, postgres is also another piece of infrastructure to run,
and not all applications use it.

~~~
manigandham
The whole point is that Faktory is another piece of infrastructure, with
arguably little value when you already have all of the features within Redis
or a SQL database which you're probably running anyway. Sidekiq itself
requires Redis.

For postgres, _SKIP LOCKED_ is the easiest way to queue because the row only
gets deleted if the transaction commits, which is when the job finishes.
Otherwise if there's an error than don't commit or just crash and the row
remains in the queue and another worker process tries it again.

~~~
karmajunkie
With redis you're back to the original problem of needing to code things like
retries back into your client. Postgres may have some limited abilities to use
it as a message queue but you're still going to be missing some of the other
features that will end up in faktory like periodic jobs.

If pg works for your situation, that's great, but i don't know why you'd
assume that none of these objections have come up before and been responded
to.

~~~
manigandham
If you want the client-side retry logic too then that requires language-
specific libraries, and it seems that's what Faktory requires anyway so this
whole thing becomes unnecessary.

They couldve just ported the sidekiq library to several languages or use a
single C library with language wrappers and gotten the same result with less
work.

------
rubiquity
When you peel back a layer or two from the background job onion you land smack
dab in the land of message queues. I’ll be curious to see what trade offs this
library makes around classic message queue problems such as delivery
semantics, visibility windows, sharding or replication for multi node, etc.

There’s definitely a lot of message queues out there these days and things
like Kafka which turn into message queues. That said, a lot of them are a pain
to operate so there’s room for one that is easy to deploy and operate.

------
sandGorgon
> _It uses Facebook 's high-performance RocksDB embedded datastore internally
> to persist all job data, queues, error state, etc_

This breaks the devops story for me. The difference in featureset between
Rocksdb and redis is not that big.. However redis is hugely supported on the
cloud and in high-availability fully-managed mode.

Its so convenient to use sidekiq on heroku or aws. I really hope you build
this on redis rather than a new persistence server.

~~~
deedubaya
Correct me if I'm wrong, but RocksDB is embedded withink faktory, so this
would actually be one less service to manage than using an external redis
server. Isn't that a good thing?

From my experience, when running high throughput, quickly executing sidekiq
workers on heroku, the expense often doesn't come from dynos, but the redis
instance as the limiting scale factor usually comes down to connections. That
won't be a problem with Faktory and an embedded datastore.

~~~
wilde
It depends on how comfortable you are developing a monitoring and failover
system for something stateful. Right now AWS handles failover of our redis
instances. With Faktory, I’m not sure what that looks like yet.

~~~
deedubaya
Indeed. It will be interesting to see how it handles fail over in the case of
an outage.

------
deedubaya
When Mike released Sidekiq for Crystal-lang, I thought this might be in the
works. From the readme:

> The Ruby and Crystal versions of Sidekiq must remain data compatible in
> Redis. Both versions should be able to create and process jobs from each
> other. Their APIs are not and should not be identical but rather idiomatic
> to their respective languages.

It makes sense to unify the protocol for background job processing, making
them language agnostic. Some languages tackle different problems better than
others, so this will be a really useful tool.

Great work, Mike. Keep the hits rolling.

------
languagehacker
Mike makes some amazing software in this space. I have some mostly pragmatic
concerns about building out a framework that requires some specialized server
to handle state for workers, though.

Maybe for something like RabbitMQ or SQS, this would be a satisfactory
replacement, since these seem to be relatively monotasked persistence servers.
So for your traditional Celery+RabbitMQ deployment, for instance, this could
be a good replacement.

But let's we consider cases like Redis, Memcached, or Kafka, where the
persistence store we're using is often also being utilized as a cache or
linear log in other aspects of the same product. This would make Faktory
troublesome, because it introduces additional maintenance costs compared to a
service that we already need. Furthermore, if I can use an off-the-shelf,
hosted storage solution for enqueued job definitions, like we do with
Elasticache, I reduce my operational costs even more.

So it's not a question of whether or not Faktory works, but whether it's worth
the cost of building, deploying, and maintaining a specialized monotasker
server instance on top of the worker pool I already need to build and
maintain. I'd be interested in understanding where the long-term value add
would be in most large-scale practical SOAs, and how folks excited about this
project anticipate implementing it might go.

------
mperham
For those of you pointing out flaws, remember you are picking apart a pre-1.0,
just launched project. We're all hackers and startup people right? Think MVP,
no one can ship a perfect system completely finished.

~~~
manigandham
The only _flaw_ here seems to be a lack of use-case... why not just add an
HTTP API to Sidekiq instead, if it doesn't already have it? That way the value
of the queue logic and semantics can be offered to any external client.

~~~
mperham
Sidekiq is not a server, it's a Ruby worker process. Redis is the server and I
can't build a fast, easy to use embedded queue system on top of Redis.

~~~
manigandham
Can't Sidekiq just run as a standalone process? It can take a connection
string to use existing Redis/RDBMS with option to use embedded server maybe.
You can just package it with Redis in a container too.

Although to be honest, the basics of a work queue are well covered now with
the evolution of cloud services and other databases and message systems.

~~~
mperham
Sidekiq is Ruby only. That won't support one of my key goals with Faktory:
polyglot. Background jobs can benefit applications written in many languages.

There's many different tools and many different users. No one choice is
appropriate for all. I hope some people find Faktory useful.

------
odammit
I’ve used Sidekiq for years. Great project. Looking forward to giving this a
swing.

I really respect what Mike’s done being able to monetize his work on cool open
source projects.

------
lobster_johnson
This looks nice, but I'm also confused by why anyone would build a single-node
data store in 2017. I can't find any information about whether replication and
HA/failover is planned. Writing a queue on top of RocksDB is arguably trivial;
it's the other stuff that is difficult.

The other thing about job systems that a lot of people seem to ignore is
client-side scaling. We run our apps on Kubernetes, where you'd naturally want
to tune worker scheduling dynamically to accomodate queue size. Feeding custom
queue metrics into the horizontal pod autoscaler is one way to do this.

~~~
pmontra
Many web applications are so low traffic that a single node is enough to
handle everything: application server, database, queue system, static content,
etc. It's up to the business deciding if they want to spend the money to buy
high availability. In my experience almost everybody don't and have lived
years with only occasional downtimes that didn't harm their business.

~~~
sscarduzio
yes but downtime is different than data loss

------
continuations
On a related matter, how does job queue differ from message queue?

What can a job queue do that Kafka or RabbitMQ cannot do?

~~~
mperham
Background jobs are specialized messages. You can build a decent job system on
top of a message queue but you'll lose the specifics. For instance, Rabbit and
Kafka won't give you the built in retry system with error tracking and UI.
Faktory enforces a specialized message format (the job payload) and can do
many things with that data; message queues that treat messages as a simple
byte array can't do that.

[https://github.com/contribsys/faktory/wiki/The-Job-
Payload](https://github.com/contribsys/faktory/wiki/The-Job-Payload)

------
perfect_kiss
One of the key features I love beanstalkd for, is a support for assigning
different priorities for the jobs in the same queue, so the jobs with larger
priority assigned will be processed first. Looks like this feature is missing
from Faktory.

~~~
mperham
[https://github.com/contribsys/faktory/issues/new](https://github.com/contribsys/faktory/issues/new)

------
nikolay
How does this compare to
[https://github.com/antirez/disque](https://github.com/antirez/disque)?

~~~
manigandham
Disque was never officially released and is considered deprecated now. Redis
already does well as a queue, v4.0 came with modules which add even more
functionality, and future releases will include a new _streams_ datatype,
similar to Kafka.

~~~
nikolay
Thanks for sharing this - I've obviously missed these developments.

------
bryanlarsen
Looks very similar to beanstalkd, the only difference I can see is that
Faktory comes with a bundled GUI. Is there anything else I'm missing?

~~~
aarondf
From the FAK:

> Faktory aims to be more feature-rich and better supported. Many of Faktory's
> OSS competitors are "dead" and no longer supported. I am fortunate enough to
> have both expertise in background jobs and a business model to support
> Faktory long-term.

~~~
falcolas
Background job systems rarely need extensive support (or, for that matter, a
constant stream of features) I've used Gearman for years, to great success.

Not saying that Gearman is a better solution, just saying that "better
supported" for such a relatively simple tool is not always necessary.

------
sscarduzio
This would be great as a layer on top of SQS and/or GCE PubSub. And I'd host
Faktory in a Lambda or CloudFunction.

