Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Hatchet – Open-source distributed task queue (github.com/hatchet-dev)
578 points by abelanger 10 months ago | hide | past | favorite | 189 comments
Hello HN, we're Gabe and Alexander from Hatchet (https://hatchet.run), we're working on an open-source, distributed task queue. It's an alternative to tools like Celery for Python and BullMQ for Node.js, primarily focused on reliability and observability. It uses Postgres for the underlying queue.

Why build another managed queue? We wanted to build something with the benefits of full transactional enqueueing - particularly for dependent, DAG-style execution - and felt strongly that Postgres solves for 99.9% of queueing use-cases better than most alternatives (Celery uses Redis or RabbitMQ as a broker, BullMQ uses Redis). Since the introduction of SKIP LOCKED and the milestones of recent PG releases (like active-active replication), it's becoming more feasible to horizontally scale Postgres across multiple regions and vertically scale to 10k TPS or more. Many queues (like BullMQ) are built on Redis and data loss can occur when suffering OOM if you're not careful, and using PG helps avoid an entire class of problems.

We also wanted something that was significantly easier to use and debug for application developers. A lot of times the burden of building task observability falls on the infra/platform team (for example, asking the infra team to build a Grafana view for their tasks based on exported prom metrics). We're building this type of observability directly into Hatchet.

What do we mean by "distributed"? You can run workers (the instances which run tasks) across multiple VMs, clusters and regions - they are remotely invoked via a long-lived gRPC connection with the Hatchet queue. We've attempted to optimize our latency to get our task start times down to 25-50ms and much more optimization is on the roadmap.

We also support a number of extra features that you'd expect, like retries, timeouts, cron schedules, dependent tasks. A few things we're currently working on - we use RabbitMQ (confusing, yes) for pub/sub between engine components and would prefer to just use Postgres, but didn't want to spend additional time on the exchange logic until we built a stable underlying queue. We are also considering the use of NATS for engine-engine and engine-worker connections.

We'd greatly appreciate any feedback you have and hope you get the chance to try out Hatchet.




I love your vision and am excited to see the execution! I've been looking for exactly this product (postgres-backed task queue with workers in multiple languages and decent built-in observability) for like... 3 years. Every 6 months I'll check in and see if someone has built it yet, evaluate the alternatives, and come away disappointed.

One important feature request that probably would block our adoption: one reason why I prefer a postgres-backed queue over eg. Redis is just to simplify our infra by having fewer servers and technologies in the stack. Adding in RabbitMQ is definitely an extra dependency I'd really like to avoid.

(Currently we've settled on graphile-worker which is fine for what it does, but leaves a lot of boxes unchecked.)


Funny how this is vision now. I started my career 29 years ago at a company that build exactly this, but based on oracle. The agents would run on Solaris, aix, vax vms, hpux, windows nt, iris, etc. Was also used to create an automated cicd pipeline to build all binaries on all these different systems.


Because people don’t know what they don’t know, and, learning from others (along with human knowledge sharing and transfer) doesn’t seem to be what society often prioritizes in general.

Not so much talking about the original post, I think it’s awesome what they are building, and clearly they have learned by observing other things.


Also basically has existed as an open source (pro version has web dashboard and complex task zoo) drop-in library (no sidecar dependencies outside of postgres) in Elixir for years called Oban.


Yep, it feels like half the show hn launches is for infrastructure tooling that already exist natively or as plug and play libraries for Elixir/Erlang.

I really try to suggest people skip Node and learn a proper backend language with a solid framework with a proven architecture.


Oban looks great, how would one run a python cuda based workload on it?


You could shell out to execute with porcelain, make the python a long-running process and use ports, or port your python code to NX.


Thank you, appreciate the kind words! What boxes are you looking to check?

Yes, I'm not a fan of the RabbitMQ dependency either - see here for the reasoning: https://news.ycombinator.com/item?id=39643940.

It would take some work to replace this with listen/notify in Postgres, less work to replace this with an in-memory component, but we can't provide the same guarantees in that case.


I come to this only as an interested observer, but my experience with listen/notify is that it outperforms rabbitmq/kafka in small to medium operations and has always pleasantly surprised me. You might find out it's a little easier than you think to slim your dependency stack down.


How do you handle things when no listeners are available to be notified?


Presumably there'd be a messages table that you listen/notify on, and you'd replay messages that weren't consumed when a listener rejoins. But yeah, this is the overhead I was referencing.


Yep, but practically speaking, you need those records anyway even if you're using another queue to actually distribute the jobs. At least every system I've ever built of a reasonable size has a job audit table anyway. Plus it's an "Enterprise Feature™" so you can differentiate on it if you like that kind of feature-based pricing


Postgres's LISTEN/NOTIFY doesn't keep those kinds of records. The whole point of using SKIP LOCK is so that you can make updates to rows to keep those kinds of messages with concurrent consumers.


Yes. I'm saying you'll manually need to insert some kind of job audit log into a different table. Cheers


With the way LISTEN/NOTIFY works, Postgres doesn't keep a record of messages that are not sent. So you cannot replay this. Unless you know something about postgresql that I don't know about.


You insert work-to-be-performed into a table, and use NOTIFY only to wake up consumers that there is more work to be had. Consumers that weren't there at the time of NOTIFY can look at the rows in the table at startup.


I see. So the notify is just to say there is work to be performed, but there is no payload that includes the job. The consumer still has to make a query. If there isn’t enough work, the queries should come back empty. This saves from having to poll, but not a true push system.


as far as I can tell NOTIFY is fanout, in the sense that it will send a message to all the LISTENing connections, so it wouldn't make sense in that context anyway. It's not one-to-one, it's about making sure that jobs get picked up in a timely fashion. If you're doing something fancier with event sourcing or equivalent, you can send events via NOTIFY, and have clients decide what to do with those events then.

Quoth the manual: "The NOTIFY command sends a notification event together with an optional “payload” string to each client application that has previously executed LISTEN channel for the specified channel name in the current database. Notifications are visible to all users."


Notify can be triggered with stored procedures to send payloads related to changes to a table. It can be set up to send the id of a row that was inserted or updated, for example. (But WAL replication is usually better for this)


Broadcasting the id to a lot of workers is not useful, only one of them should work on the task. Waking up the workers to do a SELECT FOR UPDATE .. SKIP LOCKED is the trick. At best the NOTIFY payload could include the kind of worker that should wake up.


Boxes-wise, I'd like a management interface at least as good as the one Sidekiq had in Rails for years. Would also need some hard numbers around performance and probably a bit more battle-testing before using this in our current product.


You can do a fair amount of this with Postgres using locks out of the box. It’s not super intuitive but I’ve been using just Postgres and locks in production for many years for large task distribution across independent nodes.



Looks very similar to my solution. :-)


For what it's worth, RabbitMQ is extremely low maintenance, fire and forget. In the multiple years we've used it in production I can't remember a single time we had an issue with rabbit or that we needed to do anything after the initial set up.


Not sure if you saw it but Graphile Worker supports jobs written in arbitrary languages so long as your OS can execute them: https://worker.graphile.org/docs/tasks#loading-executable-fi...

Would be interested to know what features you feel it’s lacking.


That's interesting! Would that still involve each worker node needing to have Nodejs installed to run the process that actually reads from the queue? That's doable, but makes the deployment story a little more annoying/complicated if I want a worker that just runs Python or Rust or something.

Feature-wise, the biggest missing pieces from Graphile Worker for me are (1) a robust management web ui and (2) really strong documentation.


Yes, currently Node is the runtime but we could bundle that up into a binary blob if that would help; one thing to download rather than installing Node and all its dependencies?

A UI is a common request, something I’ve been considering investing effort into. I don’t think we’ll ever have one in the core package, but probably as a separate package/plugin (even a third party one); we’ve been thinking more about the events and APIs such a system would need and making these available, and adding a plugin system to enable tighter integration.

Could you expand on what’s missing in the documentation? That’s been a focus recently (as you may have noticed with the new expanded docusaurus site linked previously rather than just a README), but documentation can always be improved.


Hope im not misunderstanding, but have you checked gearman? While I haven't used it personally, ive used similar thing but in c#, namely hangfire.


Windmill is is built exactly like that, what box is left unchecked for it if you had time to review it?


Note that Hatchet is MIT license and Windmill is AGPL-3.. that's enough of a reason for many.


Why does the RabbitMQ dependency matter?

It was pretty painless for me to set up and write tests against. The operator works well and is really simple if you want to save money.

I mean, isn’t Hatchett another dependency? Graphile Worker? I like all these things, but why draw the line at one thing over another over essentially aesthetics?

You better start believing in dependencies if you’re a programmer.


Introducing another piece of software instead of using one you already use anyway introduces new failures. That’s hardly aesthetics.

As a professional I’m allergic to statements like “you better start believing in X”. How can you even have objective discourse at work like that?


> Introducing another piece of software instead of using one you already use anyway introduces new failures.

Okay, but we're talking about this on a post about using another piece of software.

What is the rational for, well this additional dependency, Hatchet, that's okay, and its inevitable failures are okay, but this other dependency, RabbitMQ, which does something different, but will have fewer failures for some objective reasons, that's not okay?

Hatchet is very much about aesthetics. What else does Hatchet have going on? It doesn't have a lot of history, it's going to have a lot of bugs. It works as a DSL written in Python annotations, which is very much an aesthetic choice, very much something I see a bunch of AI startups doing, which I personally think is kind of dumb. Like OpenAI tools are "just" JSON schemas, they don't reinvent everything, and yet Trigger, Hatchet, Runloop, etc., they're all doing DSLs. It hews to a specific promotional playbook that is also very aesthetic. Is this not the "objective discourse at work" you are looking for?

I am not saying it is bad, I am saying that 99% of people adopting it will be doing so for essentially aesthetic reasons - and being less knowledgable about alternatives might describe 50-80% of the audience, but to me, being less knowledgeable as a "professional" is an aesthetic choice. There's nothing wrong with this.

You can get into the weeds about what you meant by whatever you said. I am aware. But I am really saying, I'm dubious of anyone promoting "Use my new thing X which is good because it doesn't introduce a new dependency." It's an oxymoron plainly on its face. It's not in their marketing copy but the author is talking about it here, and maybe the author isn't completely sincere, maybe the author doesn't care and will happily write everything on top of RabbitMQ if someone were willing to pay for it, because that decision doesn't really matter. The author is just being reactive to people's aesthetics, that programmers on social media "like" Postgres more than RabbitMQ, for reasons, and that means you can "only" use one, but that none of those reasons are particularly well informed by experience or whatever, yet nonetheless strongly held.

When you want to explain something that doesn't make objective sense when read literally, okay, it might have an aesthetic explanation that makes more sense.


> You can get into the weeds about what you meant by whatever you said. I am aware.

>When you want to explain something that doesn't make objective sense when read literally, okay, it might have an aesthetic explanation that makes more sense.

What an attitude and way to kill a discussion. Again, hard for me to imagine that you're able to have objective discussions at work. As you wish I won't engage in discourse with you so you can feel smart.


There is some implicit context you are missing here.

Tools like hatchet are one less dependency for projects already using Postgres: Postgres has become a de-facto database to build against.

Compare that to an application built on top of Postgres and using Celery + Redis/RabbitMQ.

Also, it seems like you are confusing aesthetic with ergonomics. Since forever, software developers have tried to improve on all of "aesthetics" (code/system structure appearance), "ergonomics" (how easy/fast is it to build with) and "performance" (how well it works), and the cycle has been continuous (we introduce extra abstractions, then do away with some when it gets overly complex, and on and on).


"Since forever, software developers have tried to improve on all of "aesthetics" (code/system structure appearance), "ergonomics" (how easy/fast is it to build with) and "performance" (how well it works), and the cycle has been continuous"

Fast,easy,well,cheap is not a quality measure but it sure is a way to build more useless abstractions. You tell me which abstractions has made your software twice as effective.


Efficacy has more to do with the specific situation than the tools you use. Rather, it is versatility of a tool that allows someone to take advantage of the situation.

What makes abstractions more versatile has more to do with its composability and expressiveness of those compositions.

An abstraction that attempts to (apparently) reduce complexity without also being composable, is overall less versatile. Usually, something that does one thing well, is designed to also be as simple as possible. Otherwise you are increasing the overall complexity (and reducing reliability or making it fragile instead of anti-fragile) for very little gain.


I fully agree with you.

'But I am really saying, I'm dubious of anyone promoting "Use my new thing X which is good because it doesn't introduce a new dependency."'

"Advances in software technology and increasing economic pressure have begun to break down many of the barriers to improved software productivity. The ${PRODUCT} is designed to remove the remaining barriers […]"

It reads like the above quote from the pitch of r1000 in 1985. https://datamuseum.dk/bits/30003882


And you better start critically assessing dependencies if you're a programmer. They aren't free; this is a wild take.


> You better start believing in dependencies if you’re a programmer.

Yeah, faith will be your last resort when the resulting tower of babel fails in hitherto unknown to man modes.


Something I really like about some pub/sub systems is Push subscriptions. For example in GCP pub/sub you can have a "subscriber" that is not pulling events off the queue but instead is an http endpoint where events are pushed to.

The nice thing about this is that you can use a runtime like cloud run or lambda and allow that runtime to scale based on http requests and also scale to zero.

Setting up autoscaling for workers can be a little bit more finicky, e.g. in kubernetes you might set up KEDA autoscaling based on some queue depth metrics but these might need to be exported from rabbit.

I suppose you could have a setup where your daemon worker is making http requests and in that sense "push" to the place where jobs are actually running but this adds another level of complexity.

Is there any plan to support a push model where you can push jobs into http and some daemons that are holding the http connections opened?


I like that idea, basically the first HTTP request ensures the worker gets spun up on a lambda, and the task gets picked up on the next poll when the worker is running. We already have the underlying push model for our streaming feature: https://docs.hatchet.run/home/features/streaming. Can configure this to post to an HTTP endpoint pretty easily.

The daemon feels fragile to me, why not just shut down the worker client-side after some period of inactivity?


I think it depends on the http runtime. One of the things with cloud run is that if the server is not handling requests, it doesn't get CPU time. So even if the first request is "wake up", it wouldn't get any CPU to poll outside of the request-response cycle.

You can configure cloud run to always allocate CPU but it's a lot more expensive. I don't think it would be a good autoscaling story since autoscaling is based on http requests being processed. (maybe can be done via CPU but that's may not be what you want, it may not even be cpu bound)


https://cloud.google.com/tasks is such a good model and I really want an open source version of it (or to finally bite the bullet and write my own).

Having http targets means you get things like rate limiting, middleware, and observability that your regular application uses, and you aren’t tied to whatever backend the task system supports.

Set up a separate scaling group and away you go.


Mergent (YC S21 - https://mergent.co) might be precisely what you're looking for in terms of a push-over-HTTP model for background jobs and crons.

You simply define a task using our API and we take care of pushing it to any HTTP endpoint, holding the connection open and using the HTTP status code to determine success/failure, whether or not we should retry, etc.

Happy to answer any questions here or over email james@mergent.co


You might want to look at https://www.inngest.com for that. Disclaimer: I'm a cofounder. We released event-driven step functions about 20 months ago.


Looks cool but looks like it's only typescript. If there is a json payload, couldn't any web server handle it?


We support TS, Python, Golang, and Java/Kotlin with official SDKs, and our SDK spec is open, so yes — any server can handle it :)


> For example in GCP pub/sub you can have a "subscriber" that is not pulling events off the queue but instead is an http endpoint where events are pushed to.

That just means that there's a lightweight worker that does the HTTP POST to your "subscriber". With retries etc, just like it's done here.


There are some tools like Apache Nifi which call this pattern an HTTP listener. it’s also basically a kind of a sink, and also sort of resembles webhook architecture.


Yep we are using cloud tasks and pub sub a lot. Another big benefit is that the GCP infra is literally “pushing” your messages even if your infra goes down.


The push queue model has major benefits has you mentioned. We've built Hookdeck (hookdeck.com) on that premise. I hope we see more projects adopt it.


Just pointing out even though this is a "Show HN" they are, indeed, backed by YC.

Is this going to follow the "open core" pattern or will there be a different path to revenue?


Yep, we're backed by YC in the W24 batch - this is evident on our landing page [1].

We're both second time CTOs and we've been on both sides of this, as consumers of and creators of OSS. I was previously a co-founder and CTO of Porter [2], which had an open-core model. There are two risks that most companies think about in the open core model:

1. Big companies using your platform without contributing back in some way or buying a license. I think this is less of a risk, because these organizations are incentivized to buy a support license to help with maintenance, upgrades, and since we sit on a critical path, with uptime.

2. Hyperscalers folding your product in to their offering [3]. This is a bigger risk but is also a bit of a "champagne problem".

Note that smaller companies/individual developers are who we'd like to enable, not crowd out. If people would like to use our cloud offering because it reduces the headache for them, they should do so. If they just want to run our service and manage their own PostgreSQL, they should have the option to do that too.

Based on all of this, here's where we land on things:

1. Everything we've built so far has been 100% MIT licensed. We'd like to keep it that way and make money off of Hatchet Cloud. We'll likely roll out a separate enterprise support agreement for self hosting.

2. Our cloud version isn't going to run a different core engine or API server than our open source version. We'll write interfaces for all plugins to our servers and engines, so even if we have something super specific to how we've chosen to do things on the cloud version, we'll expose the options to write your own plugins on the engine and server.

3. We'd like to make self-hosting as easy to use as our cloud version. We don't want our self-hosted offering to be a second-class citizen.

Would love to hear everyone's thoughts on this.

[1] https://hatchet.run

[2] https://github.com/porter-dev/porter

[3] https://www.elastic.co/blog/why-license-change-aws


I got flagged, but I want to reiterate that you need legal means of stopping AWS from simply lifting your product wholesale. Just look at all the other companies they've turned into their own thankless premium offerings.

Put in a DAU/MAU/volume/revenue clause that pertains specifically only to hyperscalers and resellers. Don't listen to the naysayers telling you not to do it. This isn't their company or their future. They don't care if you lose your business or that you put in all of that work just for a tech giant to absorb it for free and turn it against you.

Just do it. Do it now and you won't get (astroturfed?) flack for that decision later by people who don't even have skin in the game. It's not a big deal. I would buy open core products with these protections -- it's not me you're protecting yourselves against, and I'm nowhere in the blast radius. You're trying not to die in the miasma of monolithic cloud vendors.


> path to revenue

There have to be at least 10 different ways between different cloud providers to run a distributed task queue. Amazon, Azure, GCP

Self-hosting RabbitMQ, etc.

I'm curious how they are able to convince investors that there is a sizable portion of market they think doesn't already have this solved (or already has it solved and is willing to migrate)


There will be space for improvement until every cloud has a managed offering with exactly the same interface. Like docker, postgres, S3.


I am curious to see where they differentiate themselves on observability on the longer run.

Comparing to rabbitmq it should be easier to see what is in the queue itself without mutating it, for instance.



Sure, but to see what is in the queue you have to operate on it, mutating it. With this using postgres we can just look in the table.


> I'm curious how they are able to convince investors that there is a sizable portion of market they think doesn't already have this solved

Is there any task queue you are completely happy with?

I use Redis, but it’s only half of the solution.


Wasn’t the first Dropbox introduction also a show HN?

I don’t think this is out of place


I am not saying it is out of place but I feel for such a long winded explanation of what they are doing a missing "YC W24" was surprising.


How does this compare against Temporal/Cadence/Conductor? Does hatchet also support durable execution?

https://temporal.io/ https://cadenceworkflow.io/ https://conductor-oss.org/


It's very similar - I used Temporal at a previous company to run a couple million workflows per month. The gRPC networking with workers is the most similar component, I especially liked that I only had to worry about an http2 connection with mTLS instead of a different broker protocol.

Temporal is a powerful system but we were getting to the point where it took a full-time engineer to build an observability layer around Temporal. Integrating workflows in an intuitive way with OpenTelemetry and logging was surprisingly non-arbitrary. We wanted to build more of a Vercel-like experience for managing workflows.

We have a section on the docs page for durable execution [1], also see the comment on HN [2]. Like I mention in that comment, we still have a long way to go before users can write a full workflow in code in the same style as a Temporal workflow, users either define the execution path ahead of time or invoke a child workflow from an existing workflow. This is also something that requires customization for each SDK - like Temporal's custom asyncio event loop in their Python SDK [3]. We don't want to roll this out until we can be sure about compatibility with the way most people write their functions.

[1] https://docs.hatchet.run/home/features/durable-execution

[2] https://news.ycombinator.com/item?id=39643881

[3] https://github.com/temporalio/sdk-python


Well, you just got an user. Love the concept of temporal, but i can't justify the overhead you need with infra to make it work for the upper guys... And the cloud offering is a bit expensive for small companies.


Do you know about the Temporal startup program? It gives enough credits to offset support fees for 2 years. https://temporal.io/startup


I know its gonna sound entitled. But even though we are a small company we still process a lot of events from third parties. Temporal cloud pricing is based on number of actions, 2400 bucks would only cover some months in our case.


If you are expecting to still be small after 2 years that just delays the expense until you are locked in?


> we were getting to the point where it took a full-time engineer to build an observability layer around Temporal

We did it in like 5 minutes by adding in otel traces? And maybe another 15 to add their grafana dashboard?

What obstacles did you experience here?


Well, for one - most otel services (like Honeycomb) are designed around aggregate views, and engineers found it difficult to track down the failure of specific workflows. We were already using Sentry, had started adding prom + grafana into our stack, and were already using mezmo for logging. So to debug a workflow, we'd see an alert come in through Sentry, grab the workflow ID and activity ID, perform a search in the Temporal console, track down the failed activity (of which there could be between 1-100 activities), and associate that with our logs in mezmo (involving a new query syntax). This is a lot of raw data that takes time to parse and figure out what's going wrong. And then we wanted to build out a view of worker health, which involves a new set of dashboards and alerts that are different from our error alerting in Sentry.

Yes, this sounded broken to us too - we were aware of the promise of consolidation with an opentelemetry and a Grafana stack, but we couldn't make this transition happen cleanly, and when you're already relying on certain tools for your API it makes the transition more difficult. There's also upskilling involved in getting engineers on the team to adjust to otel when they're used to more intuitive tools like sentry and mezmo.

A good set of default metrics, better search, and views for worker performance and pools - that would have gone a long way. The extent of Temporal UI features are basic recent workflows, an expanded workflow view with stack traces for thrown errors, a schedules page, and a settings page.


With NATS in the stack, what's the advantage over using NATS directly?


I'm assuming specifically you mean Nex functions? Otherwise NATS gives you connectivity and a message queue - it doesn't (or didn't) have the concept of task executions or workflows.

With regards to Nex -- it isn't fully stable and only supports Javascript/Webassembly. It's also extremely new, so I'd be curious to see how things stabilize in the coming year.


(Disclaimer: I am a NATS maintainer and work for Synadia)

The parent comment may have been referring to the fact that NATS has support for durable (and replicated) work queue streams, so those could be used directly for queuing tasks and having a set of workers dequeuing concurrently. And this is regardless if you would want to use Nex or not. Nex is indeed fairly new, but the team on is iterating on it quickly and we are dog-fooding it internally to keep stabilizing it.

The other benefits of NATS is the built-in multi-tenancy which would allow for distinct applications/teams/contexts to have an isolated set of streams and messaging. It acts as a secure namespace.

NATS supports clustering within a region or across regions. For example, Synadia hosts a supercluster in many different regions across the globe and across the three major cloud providers. As it applies to distributed work queues, you can place work queue streams in a cluster within a region/provider closest to the users/apps enqueuing the work, and then deploy workers in the same region for optimizing latency of dequeuing and processing.

Could be worth a deeper look on how much you could leverage for this use case.


I wasn't thinking of Nex, I didn't realize Hatchet includes compute and doesn't just store tasks.

Still, it seems like NATS + any lambda implementation + a dumb service that wakes lambdas when they need to process something, would be simple to set up and in combination do the same thing.


I recently found Nex in the context of Wasmcloud [0] and ability for it to support long-running tasks/workflows. Impression that indeed Nex needs a good time to mature still. There was also a talk [1] about using Temporal here. For Hatchet it may be interesting to check it out (note: I am not affiliated with Wasmcloud, nor currently using it).

[0] https://wasmcloud.com

[1] https://www.temporal.io/replay/videos/zero-downtime-deploys-...


I need task queues where the client (web browser) can listen to the progress of the task through completion.

I love the simplicity & approachability of Deno queues for example, but I’d need to roll my own way to subscribe to task status from the client.

Wondering if perhaps the Postgres underpinnings here would make that possible.

EDIT: seems so! https://docs.hatchet.run/home/features/streaming


Yep, exactly - Gabe has also been thinking about providing per-user signed URLs to task executions so clients can subscribe more easily without a long-lived token. So basically, you would start the workflow from your API, and pass back the signed URL to the client, where we would then provide a React hook to get task updates automatically. We need this ourselves once we open our cloud instance up to self-serve, since we want to provision separate queues per user, with a Hatchet workflow of course.


Awesome to hear!


If you need to listen for the progress only, try server-sent events, maybe?: https://en.wikipedia.org/wiki/Server-sent_events

It's dead simple: an existence of the URI means the topic/channel/whathaveu exists, to access it one needs to know the URI, data streamed but no access to old data, multiple consumers no problem.


Ah nice! I am writing a job queue this weekend for a DAG based task runner, so timing is great. I will have a look. I don't need anything too big, but I have written some stuff for using PostgreSQL (FOR UPDATE SKIP LOCKED for the win), sqlite, and in-memory, depending on what I want to use it for.

I want the task graph to run without thinking about retries, timeouts, serialized resources, etc.

Interested to look at your particular approach.


Looks pretty great! My biggest issue with Celery has been that the observability is pretty bad. Even if you use Celery Flower, it still just doesn’t give me enough insight when I’m trying to debug some problem in production.

I’m all for just using Postgres in service of the grug brain philosophy.

Will definitely be looking into this, congrats on the launch!


Appreciate it, thank you! We've spent quite a bit of time in the Celery Flower console. Admittedly it's been a while, I'm not sure if they've added views for chains/groups/etc - it was just a linear task view when I used it.

A nice thing in Celery Flower is viewing the `args, kwargs`, whereas Hatchet operates on JSON request/response bodies, so some early users have mentioned that it's hard to get visibility into the exact typing/serialization that's happening. Something for us to work on.


I case you’re stuck with Celery for a while: I was hit with this same problem, and solved it by adding a sidecar HTTP server thread to the Python workers that would expose metrics written by the workers into a multithreaded registry. This has been working amazingly well in production for over two years now, and makes it really straightforward to get custom metrics out of a distributed Celery app.


Any chance you could share more specifics about your solution?



Looks great! Do you publish pricing for your cloud offering? For the self hosted option, are there plans to create a Kubernetes operator? With an MIT license do you fear Amazon could create a Amazon Hatchet Service sometime in the future?


Thank you!

> Do you publish pricing for your cloud offering?

Not yet, we're rolling out the cloud offering slowly to make sure we don't experience any widespread outages. As soon as we're open for self-serve on the cloud side, we'll publish our pricing model.

> For the self hosted option, are there plans to create a Kubernetes operator?

Not at the moment, our initial plan was to help folks with a KEDA autoscaling setup based on Hatchet queue metrics, which is something I've done with Sidekiq queue depth. We'll probably wait to build a k8s operator after our existing Helm chart is relatively stable.

> With an MIT license do you fear Amazon could create a Amazon Hatchet Service sometime in the future?

Yes. The question is whether that risk is worth the tradeoff of not being MIT-licensed. There are also paths to getting integrated into AWS marketplace we'll explore longer-term. I added some thoughts here: https://news.ycombinator.com/item?id=39646788.


We're building a webhook services on FastAPI + Celery + Redis + Grafana + Loki and the experience with setting up every service incrementally was miserable, and even then it feels like logs are being dropped and we run into reliability issues. Felt like something like this should exist already but I couldn't find anything at the time. Really excited to see where this takes us!


That's exactly why we built Svix[1]. Building webhooks services, even with amazing tools like FastAPI, Celery and Redis is still a big pain. So we just built a product to solve it.

Hatchet looks cool nonetheless. Queues are a pain for many other use-cases too.

1: https://www.svix.com


How does this compare to River Queue (https://riverqueue.com/)? Besides the additional Python and TS client libraries.


The underlying queue is very similar. See this comment, which details how we're different from a library client: https://news.ycombinator.com/item?id=39644327. We also have the concept of workflows, which last I checked doesn't exist in River.

I'm personally very excited about River and I think it fills an important gap in the Go ecosystem! Also now that sqlc w/ pgx seems to be getting more popular, it's very easy to integrate.


Why Hatchet might be better than Windmill: Windmill uses the same approach in PostgreSQL, very fast and has an incredibly good UI.


One repeat issue I’ve had with my past position is need to schedule an unlimited number of jobs, often months to year from now. Example use case: a patient schedules an appointment for a follow up in 6 months, so I schedule a series of appointment reminders in the days leading up to it. I might have millions of these jobs.

I started out by just entering a record into a database queue and just polling every few seconds. Functional, but our IO costs for polling weren’t ideal, and we wanted to distribute this without using stuff like schedlock. I switched to Redis but it got complicated dealing with multiple dispatchers, OOM issues, and having to run a secondary job to move individual tasks in and out of the immediate queue, etc. I had started looking at switching to backing it with PG and SKIP LOCKED, etc. but I’ve changed positions.

I can see a similar use case on my horizon wondered if Hatchet would be suitable for it.


It wouldn't be suitable for that at the moment, but might be after some refactors coming this weekend. I wrote a very quick scheduling API which pushes schedules as workflow triggers, but it's only supported on the Go SDK. It also is CPU-intensive at thousands of schedules, as the schedules are run as separate goroutines (on a dedicated `ticker` service) - I'm not proud of this. This was a pattern that made sense for the cron schedule and I just adapted it for the one-time scheduling.

Looking ahead (and back) in the database and placing an exclusive lock on the schedule is the way to do this. You basically guarantee scheduling at +/- the polling interval if your service goes down while maintaining the lock. This allows you to horizontally scale the `tickers` which are polling for the schedules.


Thanks for the follow-up! I’ll keep an eye on the progress.


why do you need to schedule things 6 months in advance, instead of, say, check everything that needs notifications in a rolling window (eg 24h ahead) and schedule those?


Well, it was a dumbed down example. In that particular case, appointments can be added, removed, or moved at any moment, so I can’t just run one job every 24 hours to tee up the next day’s work and leave it at that. Simply polling the database for messages that are due to go out gives me my just-in-time queue, but then I need to build out the work to distribute it, and we didn’t like the IO costs.

I did end up moving it Redis and basically ZADD an execution timestamp and job ID, then ZRANGEBYSCORE at my desired interval and remove those jobs as I successfully distribute them out to workers. I then set a fence time. At that time a job runs to move stuff that should have ran but didn’t (rare, thankfully) into a remediation queue, and load the next block of items that should run between now + fence. At the service level, any items with a scheduled date within the fence gets ZADDed after being inserted into the normal database. Anything outside the fence will be picked up at the appropriate time.

This worked. I was able to ramp up the polling time to get near-real time dispatch while also noticeably reducing costs. Problems were some occasional Redis issues (OOM and having to either a keep bumping up the Redis instance size or reduce the fence duration), allowing multiple pollers for redundancy and scale (I used schelock for that :/), and occasionally a bug where the poller craps out in the middle of the Redis work resulting in at least once SLA which required downstream protections to make sure I don’t send the same message multiple time to the patient.

Again, it all works but I’m interested in seeing if there are solutions that I don’t have to hand roll.


I built https://www.inngest.com specifically because of healthcare flows. You should check it out, with the obvious disclaimer that I'm biased. Here's what you need:

1. Functions which allow you to declaratively sleep until a specific time, automatically rescheduling jobs (https://www.inngest.com/docs/reference/functions/step-sleep-...).

2. Declarative cancellation, which allows you to cancel jobs if the user reschedules their appointment automatically (https://www.inngest.com/docs/guides/cancel-running-functions).

3. General reliability and API access.

Inngest does that for you, but again — disclaimer, I made it and am biased.


Couldn’t u just enqueue + change a status, then check before firing? I don’t see why you’d need more than a dumb queue and a db table for that, unless you’re doing millions of qps


can you explain why this cannot be a simple daily cronjob to query for appointments upcoming next <time window> and send out notifications at that time? polling every few seconds seems way overkill



Related, I also wrote my own distributed task queue in Python [0] and TypeScript [1] with a Show HN [2]. Time it took was about a week. I like your features, but it was easy to write my own so I'm curious how you're building a money making business around an open source product. Maybe the fact everyone writes their own means there's no best solution now, so you're trying to be that and do paid closed source features for revenue?

[0] https://github.com/wakatime/wakaq

[1] https://github.com/wakatime/wakaq-ts

[2] https://news.ycombinator.com/item?id=32730038


Nice, Waka looks cool! I've talked a bit about the tradeoffs with library-mode pollers, for example here: https://news.ycombinator.com/item?id=39644327. Which isn't to say they don't make sense, but scaling wise I think there can be some drawbacks.

> I'm curious how you're building a money making business around an open source product.

We'd like to make money off of our cloud version. See the comment on pricing here - https://news.ycombinator.com/item?id=39653084 - which also links to other comments about pricing, sorry about that.


Thanks. There's definitely a need for this, hence why I built WakaQ. Most distributed task queues have bugs or lack features and are overly complex. Would have been nice to find one I could have used instead of building my own. To be transparent, had Hatchet been around I probably would have self-hosted unless your cloud pricing gave similar throughput for the price I get on DigitalOcean. I'm unique, as a bootstrapped solo company. Maybe Hatchet can be the right solution for others. Keep the momentum going!


What specific strategies does Hatchet employ to guarantee fault tolerance and enable durable execution? How does it handle partial failures in multi-step workflows?


Each task in Hatchet is backed by a workflow [1]. Workflows are predefined steps which are persisted in PostgreSQL. If a worker dies or crashes midway through (stops heartbeating to the engine), we reassign tasks (assuming they have retries left). We also track timeouts in the database, which means if we miss a timeout, we simply retry after some amount of time. Like I mentioned in the post, we avoid some classes of faults just by relying on PostgreSQL and persisting each workflow run, so you don't need to time out with distributed locks in Redis, for example, or worry about data loss if Redis OOMs. Our `ticker` service is basically its own worker which is assigned a lease for each step run.

We also store the input/output of each workflow step in the database. So resuming a multi-step workflow is pretty simple - we just replay the step with the same input.

To zoom out a bit - unlike many alternatives [2], the execution path of a multi-step workflow in Hatchet is declared ahead of time. There are tradeoffs to this approach; it makes it much easier to run a single-step workflow or if you know the workflow execution path ahead of time. You also avoid classes of problems related to workflow versioning, we can gracefully drain older workflow version with a different execution path. It's also more natural to debug and see a DAG execution instead of debugging procedural logic.

The clear tradeoff is that you can't try...catch the execution of a single task or concatenate a bunch of futures that you wait for later. Roadmap-wise, we're considering adding procedural execution on top of our workflows concept. Which means providing a nice API for calling `await workflow.run` and capturing errors. These would be a higher-level concept in Hatchet and are not built yet.

There are some interesting concepts around using semaphores and durable leases that are relevant here, which we're exploring [3].

[1] https://docs.hatchet.run/home/basics/workflows [2] https://temporal.io [3] https://www.citusdata.com/blog/2016/08/12/state-machines-to-...


I think the answer is no but just to be sure: are you able to trigger step executions programmatically from within a step, even if you can't await their results?

Related, but separately: can you trigger a variable number of task executions from one step? If the answer to the previous question is yes then it would of course be trivial; if not, I'm wondering if you could i.e. have a task act as a generator and yield values, or just return a list, and have each individual item get passed off to its own execution of the next task(s) in the DAG.

For example some of the examples involve a load_docs step, but all loaded docs seem to be passed to the next step execution in the DAG together, unless I'm just misunderstanding something. How could we tweak such an example to have a separate task execution per document loaded? The benefits of durable execution and being able to resume an intensive workflow without repeating work is lessened if you can't naturally/easily control the size of the unit of work for task executions.


You can execute a new workflow programmatically, for example see [1]. So people have triggered, for example, 50 child workflows from a parent step. As you've identified the difficult part there is the "collect" or "gathering" step, we've had people hack around that by waiting for all the steps from a second workflow (and falling back to the list events method to get status), but this isn't an approach I'd recommend and it's not well documented. And there's no circuit breaker.

> I'm wondering if you could i.e. have a task act as a generator and yield values, or just return a list, and have each individual item get passed off to its own execution of the next task(s) in the DAG.

Yeah, we were having a conversation yesterday about this - there's probably a simple decorator we could add so that if a step returns an array, and a child step is dependent on that parent step, it fans out if a `fanout` key is set. If we can avoid unstructured trace diagrams in favor of a nice DAG-style workflow execution we'd prefer to support that.

The other thing we've started on is propagating a single "flow id" to each child workflow so we can provide the same visualization/tracing that we provide in each workflow execution. This is similar to AWS X-rays.

As I mentioned we're working on the durable workflow model, and we'll find a way to make child workflows durable in the same way activities (and child workflows) are durable on Temporal.

[1] https://docs.hatchet.run/sdks/typescript-sdk/api/admin-clien...


What happens if a worker goes silent for longer than the heartbeat duration, then a new worker is spawned, then the original worker “comes back to life”? For example, because there was a network partition, or because the first worker’s host machine was sleeping, or even just that the first worker process was CPU starved?


The heartbeat duration (5s) is not the same as the inactive duration (60s). If a worker has been down for 60 seconds, we reassign to provide some buffer and handle unstable networks. Once someone asks we'll expose these options and make them configurable.

We currently send cancellation signals for individual tasks to workers, but our cancellation signals aren't replayed if they fail on the network. This is an important edge case for us to figure out.

There's not much we can do if the worker ignores that signal. We should probably add some alerting if we see multiple responses on the same task, because that means the worker is ignoring the cancellation signal. This would also be a problem if workloads start blocking the whole thread.


Right, I meant inactive duration, of course.

Cancellation signals are tricky. You of course cannot be sure that the remote end receives it. This turns into the two generals problem.

Yes, you need monitoring for this case. I work on scientific workloads which can completely consume CPU resources. This failure scenario is quite real.

Not all tasks are idempotent, but it sounds like a prudent user should try to design things that way, since your system has “at least once” execution of tasks, as opposed to “at most once.” Despite any marketing claims, “exactly once” is not generally possible.

Good docs on this point are important, as is configurability for cases when “at most once” is preferable.


Thank you for the thorough response!


Latency is really important and that is honestly why we re-wrote most of this stuck ourselves but the project with the gurantee of 25ms< looks interesting. I wish there was an "instant" mode where enough workers are available it could just do direct placement.


To be clear, the 25ms isn't a guarantee. We have a load testing CLI [1] and the secondary steps on multi-step workflows are in the range of 25ms, while the first steps are in the range of 50ms, so that's what I'm referencing.

There's still a lot of work to do for optimization though, particularly to improve the polling interval if there aren't workers available to run the task. Some people might expect to set a max concurrency limit of 1 on each worker and have each subsequent workflow take 50ms to start, which isn't be the case at the moment.

[1] https://github.com/hatchet-dev/hatchet/tree/main/examples/lo...


How is this different from pg-boss[1]? Other than the distributed part it also seems to use skip locked.

[1] https://github.com/timgit/pg-boss


I haven't used pg-boss, and feature-wise it looks very similar and is an impressive project.

The core difference is that pg-boss is a library while Hatchet is a separate service which runs independently of your workers. This service also provides a UI and API for interacting with Hatchet - I don't think pg-boss has those things, so you'd probably have to build out observability yourself.

This doesn't make a huge difference when you're at 1 worker, but having each worker poll your database can lead to DB issues if you're not careful - I've seen some pretty low-throughput setups for very long-running jobs using a database with 60 CPUs because of polling workers. Hatchet distributes in two layers - the "engine" and the "worker" layer. Each engine polls the database and fans out to the workers over a long-lived gRPC connection. This reduces pressure on the DB and lets us manage which workers to assign tasks to based on things like max concurrent runs on each worker or worker health.


Can you explain why you chose every function to take in context? https://github.com/hatchet-dev/hatchet/blob/main/python-sdk/...

This seems like a lot of boiler plate to write functions with to me (context I created http://github.com/DAGWorks-Inc/hamilton).


We did it because there are methods that should be accessed which don't map to `args` cleanly. For example, we let users call `context.log`, `context.done` (to determine whether to return on cancellation) or `context.step_output` (to dynamically access a parent's step output). Perhaps there's a more pythonic way to do this? Admittedly this is a pattern we adapted from Go.


Probably just have it attached to self, like self.context

But nbd IMHO


yep nbd, but :/


you could just make them optional arguments that you inject if they're declared. Happy to chat more. With Hamilton we could actually build an alternative way to describe your API pretty easily...


Wow, looks great! We currently happily use graphile-worker, and have two questions:

> full transactional enqueueing

Do you mean transactional within the same transaction as the application's own state?

My guess is no (from looking at the docs, where enqueuing in the SDK looks a lot like a wire call and not issuing a SQL command over our application's existing connection pool), and that you mean transactionality between steps within the Hatchet jobs...

I get that, but fwiw transactionality of "perform business logic against entities + job enqueue" (both for queuing the job itself, as well as work performed by workers) is the primary reason we're using a PG-based job queue, as then we avoid transactional outboxes for each queue/work step.

So, dunno, loosing that would be a big deal/kinda defeat the purpose (for us) of a PG-based queue.

2nd question, not to be a downer, but I'm just genuinely curious as a wanna-be dev infra/tooling engineer, but a) why take funding to build this (it seems bootstrappable? maybe that's naive), and b) why would YC keeping putting money into these "look really neat but ...surely?... will never be the 100x returns/billion dollar companies" dev infra startups? Or maybe I'm over-estimating the size of the return/exit necessary to make it worth their while.


Yeah these are great questions.

> Do you mean transactional within the same transaction as the application's own state? My guess is no (from looking at the docs, where enqueuing in the SDK looks a lot like a wire call and not issuing a SQL command over our application's existing connection pool), and that you mean transactionality between steps within the Hatchet jobs...

Yeah, it's the latter, though we'd actually like to support the former in the longer term. There's no technical reason we can't write the workflow/task and read from the same table that you're enqueueing with in the same transaction as your application. That's the really exciting thing about the RiverQueue implementation, though it also illustrates how difficult it is to support every PG driver in an elegant way.

Transactional enqueueing is important for a whole bunch of other reasons though - like assigning workers, maintaining dependencies between tasks, implementing timeouts.

> why take funding to build this (it seems bootstrappable? maybe that's naive)

The thesis is that we can help some users offload their tasks infra with a hosted version, and hosted infra is hard to bootstrap.

> why would YC keeping putting money into these "look really neat but ...surely?... will never be the 100x returns/billion dollar companies" dev infra startups?

I think Cloudflare is an interesting example here. You could probably make similar arguments against a spam protection proxy, which was the initial service. But a lot of the core infrastructure needed for that service branches into a lot of other products, like a CDN or caching layer, or a more compelling, full-featured product like a WAF. I can't speak for YC or the other dev infra startups, but I imagine that's part of the thesis.


Nice! Thanks for the reply!

> hosted infra is hard to bootstrap.

Ah yeah, that definitely makes sense...

> a lot of the core infrastructure needed for that service branches > into a lot of other products

Ah, I think I see what you mean--the goal isn't to be "just a job queue" in 2-5 years, it's to grow into a wider platform/ecosystem/etc.

Ngl I go back/forth between rooting for dev-founded VC companies like yourself, or Benjie, the guy behind graphile-worker, who is tip-toeing into being commercially-supported.

Like I want both to win (paydays all around! :-D), but the VC money just gives such a huge edge, of establishing a very high bar of polish / UX / docs / devrel / marketing, basically using loss-leading VC money for a bet that may/may not work out, that it's very hard for the independents to compete. I have honestly been hoping post-ZIRP would swing some advantage back to the Benjie's of the world, but looks like no/not yet.

...I say all of above ^ while myself working for a VC-backed prop-tech company...so, kinda the pot calling the kettle black. :-D

Good luck! The fairness/priority queues of Hatchet definitely seem novel, at least from what I'm used to, so will keep it bookmarked/in the tool chest.


A related lively dicussion from a few months ago: https://news.ycombinator.com/item?id=37636841

Long live Postgres queues.


I've been looking for this exact thing for awhile now. I'm just starting to dig into the docs and examples, and I have a question on workflows.

I have an existing pipeline that runs tasks across two K8 clusters and share a DB. Is it possible to define steps in a workflow where the step run logic is setup to run elsewhere? Essentially not having an inline run function defined, and another worker process listening for that step name.


This depends on the SDK - both Typescript and Golang support a `registerAction` method on the worker which basically let you register a single step to only run on that worker. You would then call `putWorkflow` programmatically before starting the worker. Steps are distributed by default so they run on the workers which have registered them. Happy to provide a more concrete example for the language you're using.


Perfect. Yeah, we're using both, but mainly TS. We'll test that out.


The website for Hatchet and the GitHub repository make it look like a compelling distributed task queue solution. I see from the main website that this appears to have commercial aspirations, but I don’t see any pricing information available. Do you have a pricing model yet? I’d be apprehensive to consider using Hatchet in future projects without knowing how much it costs.


We'd like to make money off Hatchet Cloud, which is in early access - some more on that here [1] and here [2]. Pricing will be transparent once we're open access.

Like I mention in that comment, we'd like to keep our repository 100% MIT licensed. I realize this is unpopular among open source startups - and I'm sure there are good reasons for that. We've considered these reasons and still landed on the MIT license.

[1] https://news.ycombinator.com/item?id=39647101

[2] https://news.ycombinator.com/item?id=39646788


It’s been about a dozen years since I heard someone assert that some CI/CD services were the most reliable task scheduling software for periodic tasks (far better than cron). Shouldn’t the scheduling be factored out as a separate library?

I found that shocking at the time, if plausible, and wondered why nobody pulled on that thread. I suppose like me they had bigger fish to fry.


This reminds me of: https://news.ycombinator.com/item?id=28234057

If you're saying that the scheduling in Hatchet should be a separate library, we rely on go-cron [1] to run cron schedules.

[1] https://github.com/go-co-op/gocron


Honestly, I'm doing something like that right now, just not in a position to show.

All I want is a simple way to specify a tree of jobs to run to do things like checkout a git branch, build it, run the tests, then install the artifacts.

Or push a new static website to some site. Or periodically do something.

My grug brain simply doesn't want to deal with modern way of doing $SHIT. I don't need to manage a million different tasks per hour, so scaling vertically is acceptable to me, and the benefits of scaling horizontally simply don't appear in my use cases.


I'm curious if this supports coroutines at tasks in Python. It's especially useful for genAI, and legacy queues (namely Celery) are lacking in this regard.

It would help to see a mapping of Celery to Hatchet as examples. The current examples require you to understand (and buy into) Hatchet's model, but that's hard to do without understanding how it compares to existing solutions.


Ola, fellow YC founders. Surely you have seen Windmill since you refer to it in the comments below. It looks like Hatchet, being a lot more recent, has currently a subset of what Windmill offers, albeit with a focus solely on the task queue and without the self-hosted enterprise focus. So it looks more like a competitor to Inngest than of Windmill. We released workflows as code last week which was the primary differentiator with other workflow engines and us so far: https://www.windmill.dev/docs/core_concepts/workflows_as_cod...

The license is more permissive than ours MIT vs AGPLv3, and you're using Go vs Rust for us, but other than that the architecture looks extremely similar, also based mostly on Postgres with the same insights than us: it's sufficient. I'm curious where do you see the main differentiator long-term.


No connection to either company, but for what it’s worth I’d never in a million years consider Windmill and this product to be direct competitors.

We’ve had a lot of pain with celery and Redis over the years and Hatchet seems to be a pretty compelling alternative. I’d want to see the codebase stabilize a bit before seriously considering it though. And frankly I don’t see a viable path to real commercialization for them so I’d only consider it if everything you needed really was MIT licensed.

Windmill is super interesting but I view it as the next evolution of something like Zapier. Having a large corpus of templates and integrations is the power of that type of product. I understand that under the hood it is a similar paradigm, but the market positioning is rightfully night and day. And I also do see a path to real commercialization of the Windmill product because of the above.


Windmill is used by large enterprises to run critical jobs that require a predefined amount of resources and can run for months if needed, stream their logs, written in code at scale with upmost reliability, throughput and lowest overhead. The only insight from Zapier is how easy it is to develop new workflows.

I understand our positioning is not clear on our landing (and we are working on it), but my read of hatched is that what they put forward is mostly a durable execution engine for arbitrary code in python/typescript on a fleet of managed workers, which is exactly what Windmill is. We are profitable and probably wouldn't if we were MIT licensed with no enterprise features.

From reading their documentation, the implementation is extremely similar, you define workflows as code ahead of time, and then the engine make sure to have them progress reliably on your fleet of workers (one of our customer has 600 workers deployed on edge environments). There are a few minor differences, we implement the workers as generic rust binary that pull the workflows, so you never have to redeploy them to test and deploy new workflows, whereas they have developed SDK for each languages to allow you to define your own deployable workers (which is more similar to Inngest/Temporal). Also we use polling and REST instead of gRPC for communications between workers and servers.


Why not use postgres listen/notify instead of rabbitmq pub sub.


When I started on this codebase, we needed to implement some custom exchange logic that maps very neatly to fanout exchanges and non-durable queues in RabbitMQ and weren't built out on our PostgreSQL layer yet. This was a bootstrapping problem. Like I mentioned in the comment, we'd like to switch to pub/sub pattern that lets us distribute our engine over multiple geographies. Listen/notify could be the answer once we migrate to PG 16, though there are some concerns around connection poolers like pg_bouncer having limited support for listen/notify. There's a Github discussion on this if you're curious: https://github.com/hatchet-dev/hatchet/discussions/224.


I use haproxy with go listen notify of one of the libs. It works as long as the connection is up. I.e.i have a timeout of 30 min configured in haproxy. Then you have to assume you lost sync and recheck. That is not that bad every 30min... at least for me. You can configure to never close...



I see... apparently it uses both


Any plans for SDKs outside the current three? .NET Core & Java would be interesting to see..


Not at the moment - the biggest ask has been Rails, but on the other hand Sidekiq is so beloved that I'm not sure it makes sense at the moment. We have our hands very full with the 3 SDKs, though I'd love for us to support a community-backed SDK. If anyone's interested in working on that, feel free to message us in the Discord.


Congrats on the launch!

You say Celery can use Redis or RabbitMQ as a backend, but I've also used it with Postgres as a broker successfully, although on a smaller scale (just a single DB node). It's undocumented, so definitely won't recommend anybody using this in production now, but seems to still work fine. [1]

How does Hatchet compare to this setup? Also, have you considered making a plugin backend for Celery, so that old systems can be ported more easily?

[1]: https://stackoverflow.com/a/47604045/1593459


I’m interested in self hosting this. What’s the recommendation here for state persistence and self healing? Wish there was a guide for a small team who wants to self host before trying managed cloud


I think we might have had a dead link in the README to our self-hosting guide, here it is: https://docs.hatchet.run/self-hosting.

The component which needs the highest uptime is our ingestion service [1]. This ingests events from the Hatchet SDKs and is responsible for writing the workflow execution path, and then sends messages downstream to our other engine components. This is a horizontally scalable service and you should run at least 2 replicas across different AZs. Also see how to configure different services for engine components [2].

The other piece of this is PostgreSQL, use your favorite managed provider which has point-in-time restores and backups. This is the core of our self-healing, I'm not sure where it makes sense to route writes if the primary goes down.

Let me know what you need for self-hosted docs, happy to write them up for you.

[1] https://github.com/hatchet-dev/hatchet/tree/main/internal/se... [2] https://docs.hatchet.run/self-hosting/configuration-options#...


You've explained your value proposition vs. celery, but I'm curious if you also see Hatchet as an alternative to Nextflow/Snakemake which are commonly used in bioinformatics.


> Distributed

> Built on PostGRES

Not what people usually mean by distributed, caveat emptor


Citus Data would like a word.


I love this idea. I wish it existed a few years ago when I did a not so good job of implementing a distributed DAG processing system :D

Looking forward to trying it out!


In https://docs.hatchet.run/home/quickstart/installation, it says

> Welcome to Hatchet! This guide walks you through getting set up on Hatchet Cloud. If you'd like to self-host Hatchet, please see the self-hosted quickstart instead.

but the link to "self-hosted quickstart" links back to the same page


This should be fixed now, here's the direct link: https://docs.hatchet.run/self-hosting.


Does it (or will it, ie. is it planned) support delayed execution? eg. I have a task that I want to run at a certain time in the future?


It is planned - see here for more details: https://news.ycombinator.com/item?id=39646300.

We still need to do some work on this feature though, we'll make sure to document it when it's well-supported.


Looks very promising. Recently, I built an asynchronous DAG executor in Python, and I always felt I was reinventing the wheel, but when looking for a resilient and distributed DAG executor, nothing was really meeting the requirements. The feature set is appealing. Wondering if adding/removing/skipping nodes to the DAG dynamically at runtime is possible.


a little late now, but I wonder if https://github.com/DataBiosphere/toil might meet your requirements


it's somehitng interesting I will have a closer look thanks


Been following since Hatchet was an OSS TFC alternative. Seems like you guys pivoted. Curious to learn why and how you moved from the earlier value prop to this one?


Since these are task executions in a DAG, to what degree does it compete with dagster or airflow? I get that I can’t define the task with Hatchet, but if I already want to separate my DAG from my tasks, is this a viable option?


It can be used as an alternative to dagster or airflow but doesn't have the prebuilt connectors that airflow offers. And yes, there are ways to reuse tasks across workflows, but the docs for that aren't quite there yet. The key is to call a `registerAction` method and create the workflow programmatically - but we have some work to do before we publicize this pattern (for one, removing the overloading of the term action, function, step and task).

We'll be posting updates and announcements in the Discord - and the Github in our releases - I'd expect that we document this pattern pretty soon.


I wish that this was just a sdk built on top of a provider/standard. Amqp 1.0 is a standard protocol. You can build all this without being tied to a product or to rabbitMQ, with a storage provider and a amqp protocol layer.


You say this is for generative AI. How do you distribute inference across workers? Can one use just any protocol and how does this work together with the queue and fault tolerance?

Could not find any specifics on generative AI in your docs. Thanks


This isn't built specifically for generative AI, but generative AI apps typically have architectural issues that are solved by a good queueing system and worker pool. This is particularly true once you start integrating smaller, self-hosted LLMs or other types of models into your pipeline.

> How do you distribute inference across workers?

In Hatchet, "run inference" would be a task. By default, tasks get randomly assigned to workers in a FIFO fashion. But we give you a few options for controlling how tasks get ordered and sent. For example, let's say you'd like to limit users to 1 inference task at a time per session. You could do this by setting a concurrency key "<session-id>" and `maxRuns=1` [1]. This means that for each session key, you only run 1 inference task. The purpose of this would be fairness.

> Can one use just any protocol

We handle the communication between the worker and the queue through a gRPC connection. We assume that you're passing JSON-serializable objects through the queue.

[1] https://docs.hatchet.run/home/features/concurrency/round-rob...


Got it, so the underlying infrastructure (the inference nodes, if you wish) would be something to be solved outside of Hatched, but it would then allow to schedule inference tasks per user with limits.


From your experience, what would be a good way for doing Postgres Master-Master ? My understanding that Postgres Professional/EnterpriseDB based solutions provide reliable M-M and those are proprietary.


> Hatchet is built on a low-latency queue (25ms average start)

That seems pretty long - am I misunderstanding something? By my understanding this means the time from enqueue to job processing, maybe someone can enlighten me.


To clarify - you're right, this is a long time in a message/event queue.

It's not an eternity in a task queue which supports DAG-style workflows with concurrency limits and fairness strategies. The reason for this is you need to check all of the subscribed workers and assign a task in a transactional way.

The limit on the Postgres level is probably on the order of 5-10ms on a managed PG provider. Have a look at: https://news.ycombinator.com/item?id=39593384.

Also, these are not my benchmarks, but have a look at [1] for Temporal timings.

[1] https://www.windmill.dev/blog/launch-week-1/fastest-workflow...


It's only a few billion instructions on a decent sized server these days


Damn, I want one of these 100GHz CPUs you have, that sounds great.

I think you mean million :)


You'd be surprised. 1 billion instructions in 25ms is realistic these days.

My laptop can execute about 400 billion CPU instructions per second on battery.

That's about 10 billion instructions in 25ms.

Ihat's the CPU alone, i.e. not including the GPU which would increase the total considerably. Also not counting SIMD lanes as separate: The count is bona fide assembly language instructions.

It comes from cores running at ~4GHz, 8 issued instructions per clock, times 12 cores, plus 4 additional "efficiency" cores adding a bit more. People have confirmed by measurement the 8 instructions per clock is achievable (or close) in well-optimised code. Average code is more like 2-3 per cycle.

Only for short periods as the CPU is likely to get hot and thermally throttle even with its fan. But when it throttles it'll still exceed 1 billion in 25ms.

For perspective on how far silicon has come, the GPU on my laptop is reported to do about 14 trillion floating-point 32-bit calculations per second.


My ipad has 8 cores executing about 4 to 6 billion instructions a second these days (3GHz at a most ipc of about two)


Have you considered https://github.com/tembo-io/pgmq for the queue bit?


This is not a viable product, it's a feature


How does this compare to ZeroMQ (ZMQ) ?

https://zeromq.org/


Not the OP or familiar with Hatchet, but generally ZeroMQ is a bit lower down in the stack -- it's something you'd build a distributed task queue or protocol on top of, but not something you'd usually reach for if you needed one for a web service or similar unless you had very special requirements and a specific, careful design in mind.

This tool comes with more bells and whistles and presumably will be more constrained in what you can do with it, where ZeroMQ gives you the flexibility to build your own protocol. In principle they have many of the same use cases, like how you can buy ready made whipped cream or whip up your own with some heavy cream and sugar -- one approach is more constrained but works for most situations where you need some whipped cream, and the other is a lot more work and somewhat higher risk (you can over whip your cream and end up with butter), but you can do a lot more with it.


ZeroMQ is a library that implements an application layer network protocol. Hatchet is a distributed job server with durability and transaction semantics. Two completely different things at very different levels of the stack. ZeroMQ supports fan-out messaging and other messaging patterns that could maybe be used as part of a job server, but it doesn't have anything to say about durability, retries, or other concerns that job servers take care of, much less a user interface.


How is this different from cadence by Uber or swf?


Seems like this summary should be in the README


Hey @abelanger,

I got a few feature request for Pueue that were out of the scope as they didn't fit Pueue's vision, but seem to fit hatchet quite well (e.g. complex scheduling functionality and multi-agent support) :)

One thing I'm missing from your website however, is an actual view from how the interface looks like, what does the actual user interface look like.

Having the possibility to schedule stuff in a smart way is nice and all, but how do you *overlook* it? It's important to get a good overview of how your tasks perform.

Once I'm convinced that this is actually a useful piece of software, I would like to reference you in the Readme of Pueue as a alternative for users that need more powerful scheduling features (or multi-client support) :) Would that be ok for you?


Pueue looks cool, it's not an alternative to Hatchet though - looks like it's meant to be run in the terminal or by a user? We're very much meant to run in an application runtime.

Like I mentioned here [1], we'll expand our comparison section over time. If Pueue's an alternative people are asking about, we'll definitely put it in there.

> Having the possibility to schedule stuff in a smart way is nice and all, but how do you overlook it? It's important to get a good overview of how your tasks perform.

I'm not sure what you mean by this. Perhaps you're referring to this - https://news.ycombinator.com/item?id=39647154 - in which case I'd say: most software is far from perfect. Our scheduling works but has limitations and is being refactored before we advertise it and build it into our other SDKs.

[1] https://news.ycombinator.com/item?id=39643631


One of my favourite spaces and presentation in readme is clear and immediately told me what it is and most of the key information that I usually complain is missing.

However I am still missing a section on why this is different than any of the other existing and more mature solutions. What led you to develop this over existing options and what different tradeoffs did you make? Extra points if you can concisely tell me what you do badly that your 'competitors' do well because I don't believe there is a one best solution in this space, it is all tradeoffs


Sorry I am dumb and commented after clicking on the link. I would just add your hn text to the readme as that is exactly what I was looking for


Done [1]. We'll expand this section over time. There are also definite tradeoffs to our architecture - spoke to someone wanting the equivalent 1.5m PutRecord/s in Kinesis, which we're definitely not ready for because we're persist every event + task execution in Postgres.

[1] https://github.com/hatchet-dev/hatchet/blob/main/README.md#h...


My only question is why did you call it Hatchet if it doesn't cut down on your logs?

I'll show myself out.


Really have an axe to grind with this comment...


Exciting time for distributed, transactional task queue projects built on the top of PostgreSQL!

Here's the most heavily upvoted in the past 12 months

Hatchet https://news.ycombinator.com/item?id=39643136

Inngest https://news.ycombinator.com/item?id=36403014

Windmill https://news.ycombinator.com/item?id=35920082

HN comments on Temporal.io https://github.com/temporalio https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

Internally we rant about the complexity of the above projects vs using transactional job queues libs like:

river https://news.ycombinator.com/item?id=38349716

neoq: [https://github.com/acaloiaro/neoq](https://github.com/acaloi...

gue: [https://github.com/vgarvardt/gue](https://github.com/vgarvar...

Deep inside can't wait to see some like ThePrimeTimeagen to review it ;) https://www.youtube.com/@ThePrimeTimeagen




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: