Hacker News new | past | comments | ask | show | jobs | submit login
Write your own task queue (danpalmer.me)
84 points by danpalmer on Sept 12, 2022 | hide | past | favorite | 37 comments



> A recent example is WakaTime, who replaced Celery with a custom-built queue. This effort took one week to build and productionise, and consisted of just 1,264 lines of Python.

What a load of BS. Wakatime doesn't even have tests, has been used by one person and has no error handler. It's 1k lines because it's alpha sotfware.

Celery is not the most fun to play with, but it's not big for nothing. It supports many routers AND result stores, and you can mix an match. It has several policies regarding errors, has a beat daemon, provide an API for monitoring and managing (see: https://flower.readthedocs.io/en/latest/ for UI), etc.

Also, you can use celery in a very bare bone setup using only the FS for routing and storage: https://www.distributedpython.com/2018/07/03/simple-celery-s...

I understand you may want to avoid celery. I had bad experiences with it myself. Most projects can get away with rq, or even just multiprocessing.Pool.

> Before embarking on the mission of creating a new task queue do survey the existing options, but make sure not to underestimate the hidden costs of using one, or the benefits that may come with writing one from scratch.

Writting your own task queue is like creating your own csv parser. It seems like a very simple task that is barely a for loop with a split. And then you start hitting edge cases after edge cases.

Basically you are coding an asynchronous, fault tolerant process manager including task serialization, communication protocol, priority queuing and lifecycle.

That's not something you usually want to do. The cost of using one is unlikely higher that the cost of creating, maintaining and documenting one. Just go with a simple existing task queue if you need something simple. Pypi is full of those.


The task queue we built at Thread wasn’t much bigger, actually supported multiple queue backends, had tests, and has run millions of jobs a day for a decade. The team knows it very well, and it takes a few days a year to maintain, improve, and iterate to fit with new infrastructure.


Agreed on Celery - I had to dig around in that source a few times to build extensions or investigate bugs, and hooooly cow. That is some of my least-favorite Python code ever.... for a task queue. It's extremely fancy[1], probably in part because it does so much to make it "fluent"-feeling (which is honestly pretty neat).

As to building your own queue: if you can tolerate small loss on outages, redis + optionally a persisting replica makes things truly trivial. You can build a safe queue in a day with a small "pop into in-progress list" pipeline/script and a monitoring daemon to detect lost tasks. Monitoring daemons and dumb workers are super simple.

[1] or at least it was, a few years ago. it felt like reading an exploration of every single metaprogramming feature that Python offers, all interacting with each other.


They do some very strange stuff in the celery codebase like most of the classes inheriting from dict and splitting things into multiple packages and repos. Like is anyone actually going to resuse kombu (celery’s messaging package)?


I get the arguments, I think, I just... don't want to be spending any percentage of my time supporting a task queue I've written. I don't want to respond to feature requests or bug reports or questions. I want to use something that already exists that let the people who developed that thing respond to feature requests and bug reports and questions.

That said, sometimes a task queue isn't the right tool for a job. At my last company, someone built something that could be called a task queue using AWS step functions. What she needed didn't exist, so she built it. Of course, now she has to respond to feature requests, bug reports, and questions about it.


I don't see how using something that exists gets you out of that problem?


There’s very little maintenance to do when you build what you need.

Open source task queues try to solve everyone’s problems and as a result need a lot of work. By only solving your problems you may be surprised at how little code there actually is if you build on top of something like Postgres or Redis.


Though it obviously depends on the case at hand, I sort of agree with this.

For a distributed build cluster that I maintain (Buildbarn, https://github.com/buildbarn/bb-remote-execution/), I also had to implement a scheduler process that would queue compilation/test actions, so that they can be executed on workers later on.

Initially I looked into using some conventional queueing system, but eventually settled on implementing my own as part of the scheduler process. So far I'm really happy with this choice, as it has allowed me to implement the following features, and more:

- In-flight deduplication of identical compilation actions. If identical actions are scheduled with different priorities, the highest priority is used.

- Multi-level scheduling fairness between groups, users in a group, builds run by the same user, etc.. The fairness cooperates well with priorities.

- Automatic removal of queued actions that are no longer associated with any running build. When the action was in-flight deduplicated, the priority may need to be lowered again.

- Stickiness, where workers prefer picking up actions that are similar to the one they ran previously, for reducing network utilisation.

- Facilities for draining workers.

Though I'm not saying it would have been impossible to achieve this with an off the shelf task queue, I'm not convinced it would have been easy. Adding new features right now only means I need to care about the actual semantics of it, as opposed to trying to figure out how to map it onto the feature set of the queueing system of choice.


I continue to prefer building a custom queue for anything requiring any features like the ones you describe. I've tried using off-the-shelf queuing systems, and having the full state of a task being spread across systems (or in other words, having to consult multiple systems to decide if work should be done) produces such a complex, buggy mess.

Like anything else, when you need to pierce the abstraction, then it no longer serves you in its current form.


It is absolutely a good idea to write your own task queue. I agree 100%. Most of the time you have a very specific need, and you'll have to do weird stuff (or use a library and/or framework that isn't super popular) to support those features.

My example is I needed a task queue that did some basic rate limiting, but across workers. I wanted to be able to have 100 workers and still not hit an endpoint more than (roughly) n-times a second (or minute, or hour, or day, etc..).

There are systems that do this, but they are either complex and require complex backends and/or software installs (other platforms, etc..). Maybe they don't run in your desired environment, or maybe they do - but adding dependencies should not always be done lightly.

Creating my own task queue took a few days and then a few more days of addressing what were for us very minor bugs (duplicate execution of jobs was one).


How did you end up solving the rate limiting? Were the workers themselves responsible for coordination via a semaphore (e.g. there's already 25 of us, don't consume from this queue) or did you solve it on the scheduling side and only push jobs into their queues once you knew there was capacity?


Scheduling side, we only schedule jobs if there have not been N of that type scheduled during that period. Works pretty good.

The jobs are still technically "queued" but they don't get consumed until something opens up (but workers can work on other tasks).


That's cool. I've been thinking through a similar design. Would this be an accurate description of your approach?

queues hold jobs, the scheduler/dispatcher handles load-balancing and rate-limiting, pushing jobs down to workers only when flow-control criteria is met?

  ┌────────┐                 ┌─────────┐
  │ queueA ├──┐           ┌──┤worker1  │
  └────────┘  │           │  └─────────┘
              │           │
  ┌────────┐  ├───────────┴┐ ┌─────────┐
  │ queueB ├──┤dispatcher  ├─┤worker2  │
  └────────┘  ├───────────┬┘ └─────────┘
              │           │
  ┌────────┐  │           │  ┌─────────┐
  │ queueC ├──┘           └──┤workerN  │
  └────────┘                 └─────────┘


Yeah, pretty much! Technically I have one real queue but different "tags" to route things (so we can rate limit etc..). But basically the same thing.

My system supports as many dispatchers as you want, which is a good addition but makes the logic more complex as you have to be careful with locks so you don't schedule jobs many times, for example.

My system also implements retry logic, and a bunch of other stuff, but that isn't absolutely required either.


sorry, how did you draw this/



thank you


Years ago wrote a language agnostic task queue in PHP.

Now Elixir with Genservers that can handle proper concurrency and back pressure is trivial.

One of the best is Oban, which uses PostgreSQL for the queue storage. No more celery, Redis or other external dependencies needed.

Heck if you don’t need persistence you can use Que which keeps it in ram.

That being said I would never write a task queue any longer. Too many good options out there now for every major language.

We’re not Google.


My first thought was "build your own [X] and never use it", which I've always found to be a good exercise until you learn to navigate large codebases and systems and then you can learn from the existing rather than make your own mistakes.


Eh…I think there are 2 real options here.

1. Queue jobs in a database table so you get transaction guarantees.

2. Use a prebuilt system that works with any number of existing alternatives (or one that works with the database too).

I can’t imagine building a queue that meets any slice of my requirements to be able to trust it without heavily leveraging SQL.


Interesting, I took this to heart a couple of years ago and came up with something with a focus on simplicity to solve our niche issue at $fintech.

https://github.com/tomarrell/miniqueue


Word of warning - if you write your own task queue at a startup, you will spend the rest of your tenure justifying this decision to every new data engineer who joins.

Also, am I crazy or do the celery docs not even clarify their delivery semantics? Isn't that table stakes for a queueing system? As best as I could tell you can get close to "at least once" with

  acks_late=True, task_reject_on_worker_lost=True
but not in cases like a worker hanging indefinitely without being explicitly killed.


My opinion on this is that if you need at least once semantics it is best to encode this behaviour in a store you can trust, take the ID of the task and then schedule a task with that data.

Then, add a "sweeper" that checks if the task actually happened and if it did not happen, requeue the job.

To ensure proper behaviour that the same job is not being worked on twice, add some locking in there and you have a really stable way of scheduling jobs.

Could this have been done INSIDE celery? Sure, but I would say that this is not the case that it should optimize for.


Every version of this I experienced went like this:

> We have a task queue…

> Is it Celery?

> No

> oh thank god. How do I use it?

> Put the @task decorator on a function and call it. See the args to that decorator if you want slightly more control.


We use Celery with Airflow. Recently had an incident where the celery workers were getting thrashed because, by default, the workers all heartbeat to each other, via Redis.

The networking on our redis instance was maxed out all the time and we had a secret soft-cap on the number of workers that could be active.

The documentation even says that as-is this "gossiping" doesn't even do anything!

Unsure if it's worth it for us to write our own, but I felt inspired to share my Celery horror-story.


Celery has so many issues like this. Default timeouts, all kinds of stupid problems. It isn't even really cross-platform! What a joke.


The part about building on top of Postgres… the author does talk about something similar to pub sub, not abusing a table to sort tasks by create_date, correct?

Because I’ve built my own “task queue” on top of “SELECT * FROM tasks WHERE done = 0 ORDER BY create_date LIMIT 1;” and have also seen others do it. Just don’t do it like this. You’ll get spikes of tasks and then your table will be filled faster than your worker can finish the tasks. Just use something else.


Yeah so we didn’t do this, we used Redis and never really had any performance issues at that layer. You’re right that it’s A Bit More Complicated Than This to use Postgres, but it’s still tractable within a few days for someone who knows Postgres (more than just basic SQL). Maybe a little longer if you need high throughput.


This article and comments make me feel better. I wanted a job scheduler (not so different ti a job queue I guess). Heard bad things about Luigi, good things about AirFlow. But after spending half a day trying to grok it, I was terrified. I think I gave up when I had to choose between Postgres and MySQL to create user accounts, just to get started.

I wrote my own in half a day, then spent a few days total tinkering with it too, so far so good.


I've found this is a very useful exercise when your requirements are much simpler than [popular library], or you need a specific feature that it can't easily be patched to support.

Write your own, and even if you give up halfway through and use the popular library after all, you'll have learned a lot about how stuff works, and can make better decisions about how to use the library.


I wrote my own in-process task queue in Go tailored to my needs. I needed auto-scaling and ability to connect different pools into a pipeline. https://github.com/cmitsakis/workerpool-go


The only task queue I loved was beanstalkd -- it's beautifully written and highly performant. Starting it takes seconds and it's been running for a decade:

https://beanstalkd.github.io/


The only downside for me is that it uses only one core, making it hard to scale vertically.


Tiny bug report: your <title> element on that page just says "Dan Palmer", it would be better if it held the title of the blog post.


Ah crap. Thanks so much for pointing that out, I’ll fix it soon!


Links are broken.


links in the blog are 404ing




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: