Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Workq – Job Server in Go (github.com/iamduo)
185 points by iamduo on Aug 23, 2016 | hide | past | favorite | 59 comments



Workq would be far better if it followed the AMQP (https://www.amqp.org/) or SQS (https://aws.amazon.com/sqs/) specs and terminology. There's already a large body of work on those two protocols. SQS is much simpler than AMQP so it's a good first-step.


At a high level, SQS is simple, but there are a lot of additions AWS has made over time such as dead letter queues, message attributes, delayed messages. Some of these settings are per-queue level, some are overridable on a per-message basis. Not that AMQP isn't super complex, but SQS has a larger feature set surface area than a lot of people realize.

There is also a bit of oddness experienced when people try to use the various AWS libraries to hit non-AWS hosted endpoints since the happy path for these is to connect to AWS regions.

I worked on an internal SQS clone in the past (not related to current employer).

From a queue API perspective, I like iron.io's MQ API [0] and Google Cloud PubSub[1].

Disclaimer: I currently work for Nest, an Alphabet company

[0] http://dev.iron.io/mq/3/ [1] https://cloud.google.com/pubsub/reference/rest/


I have no experience with SQS, but there is definitely space for a simpler queue implementation (or even a standard) than AMQP. I think it's needlessly complex for many problem domains.


Just to clarify my own comment, I would want to see attributes like TTL made part of the queue, not part of the job.


Google PubSub has TTLs on each message, with the queue having the default value for new messages. I've found this to work really well in practice.


I wrote one of these, almost exactly, at my last $dayjob, had to check and see if it was them but opensourcing it. Alas not. Oh wait, found it, here's another Go task-queue-with-priority-and-stuff daemon: https://github.com/diffeo/go-coordinate


How would you specify a job with multiple parameters? You use "ping" as the example. If I wanted to run "ping -c 20 10.10.10.10", how would I accomplish that? It's not immediately obvious if that is possible.

From going through the source, it looks like the payload is the cmd. Can those be multiple words or will it read each word as a separate argument.

handler := s.Router.Handler(cmd.Name)

reply, err := handler.Exec(cmd)


This is more or less a message queue, one side (client) sends a message, the other side (worker) receive and act on it. But structured around the common use case of someone submitting a "job" to be performed(whenever possible, or at a certain time) and workers picking up and performing the job, reporting back the outcome.

"ping" here is just a message, not the name of a system command/executable - in this case the worker receives the "ping" request, and just replies with a "pong" message - both sides are code you need to write.

You can encode whatever you want in the payload, json, simple plaintext - it's up to the client and the worker to agree on the meaning.


Thank you, that clears up my confusion.


A job in Workq doesn't directly target a system command or anything. The short answer is a job could be thought of as a function call. The name of the job is the function, and the job's payload is its parameters. This allows a worker to connect to Workq and pick up available jobs enqueued previously by a client. It is up to the worker to decipher the payload and act on it, which may be run “ping 10.10.10.10” on the worker node.

The ping example was just given so I could enqueue a job named “ping” as a client and respond with a “pong” text result from a worker. Just a silly example :). Real use cases would be background jobs such as sending emails, http downloads, image resizing…etc.


The synchronous job processing is something we had to add to our beanstalkd client, but none of the other enhancements are useful to us.

The biggest limitation of beanstalkd IMO is the fact that robustness features have to be handled by the client. That's why we're considering switching to disque or alternatives. It doesn't look like workq supports this, and it's likely something that should be designed in from the ground floor.

But beanstalkd's proven history is a huge mark in its favour that we'd be loathe to give up.


Can you clarify the missing robustness features?


failover


Would it be possible to model dependencies between jobs using this? I.e. only run job X if job Y succeeds?

We're building an in house CI system (we have some weird requirements and can't use off the shelf ones) and we'd love to add an entire job graph to this queue and be able to query the state of it.


It is possible, but explicitly through workers. You can have a worker block on the "result" command (it allows for a wait-timeout) and wait for the successfully completion of Job X and enqueue Job Y.

There is no way to define the dependency automatically, the workers can create any type of workflow though, including if Job X fails.


What about graceful restarts? Like can I kill -USR2 the pid of this and have it stop listening on that port, launch another version of itself (new binary) that does net.FileListener(f) vs. net.Listen("tcp", url) but keep running until all current jobs are done?


A big yes to this! Signal handling will be covered in a future update. This is critical to Workq, being that it is intended to run as a standalone. A zero downtime deployment is required. There will be a more in-depth roadmap in the repo soon. Someone earlier just asked for it also.


cool, i just wrote all this go code for paradise if u wanna steal it :) https://github.com/andrewarrow/paradise_ftp/blob/master/serv...


This appears to be similar to beanstalkd but with some nice additions (scheduled time for job to run, jab max existence time).

One of the things I awlays wanted beanstalkd to have was an atomic move-tube command, so you could emulate a state machine using queues and tubes


That is correct, beanstalkd was an inspiration for this project. I've even credited it here: https://github.com/iamduo/workq#credits. Could you describe what the use case of the move-tube command was? More specifically on what it was trying to accomplish within the state machine?

Internally Workq, for simplicity, does not have any separates "tubes" however. Just a job pinned by its name.


The idea was that you would reserve a job in a tube, do some work, then move that job to another tube upon success atomically, without the need to delete the job, then create the job in the new target tube. The problem is that if you create the job in the new tube before deleting, the TTL could kick in and return the job to the original tube, meaning you have the same job in both tubes. The other alternative is to delete the job then put it in the target tube. However, if your worker process dies during this step, you may end up losing the job.


It sounds like what you need is some sort of transaction support so that deletes and adds only happen if both succeed.


Which would make things a lot more complicated - vs. issuing a move command on a job id where the server would be responsible for making sure the atomic move succeeded.

It's not a difficult feature to implemented (in fact if I recall there is a pull request open for this feature in beanstalkd) and IMHO would open up a lot of interesting use cases.


Is there a reason why there aren't 2 types of jobs? One for each stage? You mentioned the first stage there is some work performed, then it sounds like you need to pass on the work again to another worker for the second stage.


May I suggest NATS Queuing as an alternative. [1]

[1] http://nats.io/documentation/tutorials/nats-queueing/


We evaluated NATS some years ago, but the lack of delayed messages stopped us from replacing our beloved beanstalkd.

Looked very good though


Very nice work, I love the simplicity. With a reliable persistence layer, this could be a viable option for a production job queue.

What are the plans for persistence? Persist to disk? Or pluggable storage backhends? Disk, Redis, and SQL options would be cool!


Thanks. Simplicity was the main goal and I took many passes eliminating cruft. The initial plans for persistence is to disk, in the form of a Command Log. Similar to VoltDB's Command Log or Redis' AOF. A simple approach to durability.

There will be some sort of interface for the storage. I'll keep in mind pluggability. It is something Gearman had back in the day also[0]. Most likely it will be persistence to disk for some time and once more clarity comes out, possibly pluggability.

[0] http://gearman.org/manual/job_server/


If I am not wrong, a job means a message you are sending such that different workers can pick it up. Correct me if I am wrong. What's the difference between something like RabbitMQ and this? Genuinely curious.


RabbitMQ can do much of what Workq can do from a purely messaging standpoint since input and output looks about the same.

Workq is built on the higher level concept of a job so the feature set is refined around what a job is. In Workq, a job must successfully complete or fail, and optionally a result passed back to the client. A job can be retried when it has timed out or even when it has explicitly failed outright (maybe there was an temporary error with your API provider..etc). You can say: retry the job if it has timed out up to 5x BUT let it explicitly fail only once. These small refinements help streamlined the concept of processing a job fully, not just a blob of message.

Also there are some other key things such as job scheduling based on time which don't exist in RabbitMQ, but are usually offered in libraries such as DelayedJob..etc.


Just curious what advantages a TCP text interface has over HTTP with JSON which would typically be my default.


The text interface primarily has simplicity on its side. The goal was to implement a set of commands with a small footprint. HTTP would have provided "too much" for me to worry about in terms of designing the commands. However, HTTP2 was considered at one point, but it proved to be "too much" at the time.

From my own experience, the text commands are easier to test against, especially its boundaries (inputs to the server) which makes client development significantly easier.


I can see how designing the command language can be simpler and cleaner in text than say XML. HTTP offers so much tooling, I might have gone with text over HTTP or JSON with a `command` entry.

As for HTTP2, isn't that handled by the HTTP server/client implementations and is from the application code the same as HTTP?


I can definitely agree on the HTTP tooling!

As for the HTTP2 portion, the server details would be abstracted out especially with HTTP2 support in Go 1.6. At the time I looked, HTTP2 clients for various languages were still popping up and stabilizing (I think they still are). I didn't want that to be a factor when I was developing clients outside of Go (for example PHP). In addition, an important goal was to develop extremely small clients, where I understood exactly what was going over the wire.

If there are enough direct tooling benefits that HTTP2 can offer, it would be fun to experiment with it as an alternative interface. Funny enough, the first prototype name of the project was "httpq".


One advantage is it is easier to have a single persistent connection that the client can interleave in multiple requests on. That way you don't have to pay the price of constant opening and closing TCP sockets. Just open one once and use it until it dies and then open a new one. There are many reasons a socket can die so you client does have to deal with that situation.


I quite like text based protocols like this, beanstalkd and memcache. Very straight forward to write clients. Each interaction is stateless, and writing clients is trivial.

Thats not to say a HTTP interface is that much more difficult to add these days.


> In-memory only for now, disk backed durability is on the roadmap.

> Job payload & results are limited to 1 MiB each.

> Workq servers are standalone and do not speak to each other.

i.e. don't use it in production. This is a nice proof of concept, but let's not pretend that it is a professional grade product at the moment.


I'm not the author, but I don't think anyone here pretended that it's a "professional grade product". From the actual README: "Workq is in alpha status and not yet stable."


Sure, but it looks like he intends it to eventually be a professional grade product. Disk-backed durability can be added, but adding distributed/failover capability should be designed in from the beginning.

And nobody's ever going to trust your distributed capability if you implement RAFT yourself. You need to build upon something trusted or convince Aphyr to run Jepsen on your implementation. etcd is a common thing to build upon, and even it is only partially trusted.


> And nobody's ever going to trust your ...

I guess that could be said for a number of things that people successfully do?

Edit: for one great example of someone who didn't get discouraged by naysayers, check caddyserver.


Great example. Nobody's going to use caddyserver in production for a major site for a couple years for that reason. Right now it's widely used on 'hobby' sites. A couple of years of good track record on hobby sites will let it be trusted enough to run on more mission critical sites.

And caddyserver isn't a distributed service, so the level of trust required is much lower.

Distributed services are difficult to get right for a wide variety of reasons, as shown by Aphyr's Jepsen tests.


It is however already starting to get developer traction and mindshare, even a few donations if I am representative for the userbase, something it would never do now if sat waiting for the perfect timing to release a perfect product.

Also I think the README of workq didn't mention anything about raft, which is fine, you can go very far without it.

My point is: encourage people to write code! Don't infect people with paralysis-by-analysis.


Not everyone needs a distributed solution. Sometimes, "beefing up the box" is enough for most systems, not everyone needs Facebook grade scaling tools.

As they say, scaling is a nice problem to have.


True, I don't need scaling. I do need robustness to hardware failure, though.


This is the same reason why Workq is intended to be standalone. It takes so much expertise and time to implement this in a distributed nature.

I spent 80% of time on Workq just writing tests for it and there is still so much to account for even as a standalone system.

On the bright side, projects like etcd & consul (I believe the author of Raft is helping out there) are getting better and better and can be embedded.


Cool. If you think you can embed etcd or consul that'd be awesome. Please post again if/when that happens!


It seems to offer the same feature set as beanstalkd.

We have been using beanstalkd for years to process gazillions of jobs without an issue and I guess many readers are in the same position.

Why would I choose Workq over beanstalkd ?


Workq is similar to beanstalkd and was modeled after many of its concepts especially TTR and reserve (https://github.com/iamduo/workq#credits).

The one feature which may not be obvious yet is the ability for workers to mark a job successfully completed or failed with a result and then retrieve it later. The workflow looks like this:

* Client A: Backgrounds a Job A

* Client A: Backgrounds a Job B

* Client A: Backgrounds a Job C

----

* Worker A: Picks up Job A + Completes

* Worker B: Picks up Job B + Completes

* Worker C: Picks up Job C + Completes

----

* Client A: Picks up the result for Job A,B,C.

This allows a single client to concurrently process multiple jobs within a single process and retrieve its result. This is what I like to call "Gearman mode"[0], as it was modeled after that project also. Useful in languages that do not have well defined concurrency. This is a niche use case and may not be needed by everyone, but very useful as soon as you need it. This will become more obvious when I have clients for these languages.

Lastly there are some subtle enhancements such as retry support and synchronous processing (submit and wait for result).

Thanks for the great question. This is a very popular question and I will FAQ it.

[0] http://gearman.org


thanks, currently working on one hobby project and skipped job queue part because I thought I will need some more consideration :) Now it looks like this could fill that gap. Regarding persistence - how about adding https://github.com/docker/libkv ? For single node - BoltDB backend is more than enough and once you want to go distributed - just switch to Consul/Etcd.


related: go message queue at http://nsq.io/


Why did you decide to build the queuing/messaging part? There are plenty of solid options out there (eg. RabbitMQ)


Hmm...are we at the point now where someone is going to reimplement the redis server in Go?



Can it stream data from jobs? For example, to monitor progress?


There is no individual progress streaming at the moment. It was on the drawing board, but was slashed for simplicity and time. Just curious, what would you use it for in your case?


I would use it to show the progress to a user :)

For example: 68% done

Or: ETA: 15 minutes


I like it


How does this differ from Mesos or Docker Swarm?


It's a job queue with server and clients.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: