Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Batch (YC S20) – Replays for event-driven systems
154 points by dsies on Aug 17, 2020 | hide | past | favorite | 58 comments
Hello HN!

We are Ustin and Daniel, co-founders of Batch (https://batch.sh) - an event replay platform. You can think of us as version control for data passing through your messaging systems. With Batch, a company is able to go back in time, see what data looked like at a certain point and if it makes sense, replay that piece of data back into the company's systems.

This idea was born out of getting annoyed by what an unwieldy blackbox Kafka is. While many folks use Kafka for streaming, there is an equal number of Kafka users that use it as a traditional messaging system. Historically, these systems have offered very poor visibility into what's going on inside them and offer (at best) a poor replay experience. This problem is prevalent pretty much across every messaging system. Especially if the messages on the bus are serialized, it is almost guaranteed that you will have to write custom, one-off scripts when working with these systems.

This "visibility" pain point is exacerbated tenfold if you are working with event driven architectures and/or event sourcing - you must have a way to search and replay events as you will need to rebuild state in order to bring up new data stores and services. That may sound straightforward, but it's actually really involved. You have to figure out how and where to store your events, how to serialize them, search them, play them back, and how/when/if to prune, delete or archive them.

Rather than spending a ton of money on building such a replay platform in-house, we decided to build a generic one and hopefully save everyone a bunch of time and money. We are 100% believers in "buy" (vs "build") - companies should focus on building their core product and not waste time on sidequests. We've worked on these systems before at our previous gigs and decided to put our combined experience into building Batch.

A friend of mine shared this bit of insight with me (that he heard from Dave Cheney, I think?) - "Is this what you want to spend your innovation tokens on?" (referring to building something in-house) - and the answer is probably... no. So this is how we got here!

In practical terms, we give you a "connector" (in the form of a Docker image) that hooks into your messaging system as a consumer and begins copying all data that it sees on a topic/exchange to Batch. Alternatively, you can pump data into our platform via a generic HTTP or gRPC API. Once the messages reach Batch, we index them and write them to a long-term store (we use https://www.elassandra.io). At that point, you can use either our UI or HTTP API to search and replay a subset of the messages to an HTTP destination or into another messaging system.

Right now, our platform is able to ingest data from Kafka, RabbitMQ and GCP PubSub, and we've got SQS on the roadmap. Really, we're cool with adding support for whatever messaging system you need as long as it solves a problem for you.

One super cool thing is that if you are encoding your events in protobuf, we are able to decode them upon arrival on our platform, so that we can index them and let you search for data within them. In fact, we think this functionality is so cool that we really wanted to share it - surely there are other folks that need to quickly read/write encoded data to various messaging systems. We wrote https://github.com/batchcorp/plumber for that purpose. It's like curl for messaging systems and currently supports Kafka, RabbitMQ and GCP PubSub. It's a port from an internal tool we used when interacting with our own Kafka and RabbitMQ instances.

In closing, we would love for you to check out https://batch.sh and tell us what you think. Our initial thinking is to allow folks to pump their data into us for free with 1-3 days of retention. If you need more retention, that'll require $ (we're leaning towards a usage-based pricing model).

We envision Batch becoming a foundational component of your system architecture, but right now, our #1 goal is to lower the barrier to entry for event sourcing and we think that offering "out-of-the-box" replay functionality is the first step towards making this happen.

.. And if event sourcing is not your cup of tea - then you can get us in your stack to gain visibility and a peace of mind.

OK that's it! Thank you for checking us out!

~Dan & Ustin

P.S. Forgot about our creds:

I (Dan), spent a large chunk of my career working at data centers doing systems integration work. I got exposed to all kinds of esoteric things like how to integrate diesel generators into CMSs and automate VLAN provisioning for customers. I also learned that "move fast and break things" does not apply to data centers haha. After data centers, I went to work for New Relic, followed by InVision, Digital Ocean and most recently, Community (which is where I met Ustin). I work primarily in Go, consider myself a generalist, prefer light beers over IPAs and dabble in metal (music) production.

Ustin is a physicist turned computer scientist and worked towards a PhD on distributed storage over lossy networks. He has spent most of his career working as a founding engineer at startups like Community. He has a lot of experience working in Elixir and Go and working on large, complex systems.

I've worked on a very similar product in the past and can affirm that there is definitely enterprise interest for a good solution to event replay for orgs that are already doing event sourcing...I'm curious if offering out of the box replay will actually lower the bar and drive more orgs to pursue event sourcing? The CLI search functionality is really cool and useful as well.

Hey there!

Re: lowering the bar - we hope so. What we've noticed is that the papers that talk about event sourcing mention replays but don't talk at all about the implementation (or give any pointers). We're hoping that if at least that part is done for you, you've got one less thing to worry about.

As for the CLI tool - thanks! We found it super useful ourselves and figured others would too. I like to think of it as a sort of intelligent `netcat` for messaging systems :D

Congrats on the release. We've made a ragtag solution in-house that is complicated but works on those few unfortunate events that we need it. There's a demo on request but it would be helpful if we can have a better way to test the product. Maybe an endpoint where we stream maybe 10000 events and see them replay? What sort of pricing tier are we talking about?

Thank you!

Re: ragtag in-house solution that is complicated

^ That's exactly what we're talking about. These systems get complex pretty quick and you end up with duct tape in more than a few places.

As for demo - yeah, our plan is to open up registrations for accounts soon which will allow you to pump data into us for free with a low retention period.

We've still got some pieces to tighten before we can open the gates fully but we'll try to make it happen soon (within next few weeks?). In the meantime, if you want a demo, ping us and we'll make it happen.

Wow, this is a great idea. I recently worked on a team building streaming data pipelines, and we built a bespoke system to do exactly this: end-to-end test our software. We had past messages written to a >300TB sharded file, and wrote a microservice to read each shard and publish it to the message queue for our staging instance, and then run data validation/anomaly detection on the output. It was useful but incredibly painful to use and maintain, and Batch would have been a fantastic solution for accomplishing this.

Congrats, looks useful! Just an opinion, but I think you should skip the cool large animation on your homepage and just start with the "Our platform is essential in scaling and maintaining your business.". I had no idea what Batch was until I scrolled way below the fold.

Ustin here.

It's definitely on our roadmap to fix. Thank you for the feedback!

Yeah... we've heard this before. But the wavy stuff makes me feel so ... _caaaaalm_ :)

It is nice and calming, btw your twitter link at the bottom your site is broken (it links to https://batch.sh/www.twitter.com/batchsh).


How does bookmarking work/How do I keep track of how far I've read while replaying from Batch? Will you also index by date? It can take a long time to replay a lot of data; do you have any numbers on the read rates you support per topic?

Great questions!

> How does bookmarking work/How do I keep track of how far I've read while replaying from Batch?

We do not have any bookmarking functionality built (yet) as we currently expect folks to just tweak their search query. Each one of the events has a new id attached to it that you can query and reference during search.

> Will you also index by date?

We do! Every event has a microsecond timestamp attached to it.

> It can take a long time to replay a lot of data; do you have any numbers on the read rates you support per topic?

We've done some initial replay throughput tests and have been able to reach ~10k/s outbound via HTTP - of course, this is all _highly_ dependent on where you're located. We expect that for folks who need super high throughput, we'll probably need to be closer to them - we fully expect to have to peer with some of our customers and optimize for throughput by doing gRPC and ... batching :)

So far, we've done most of our testing on inbound and we are currently able to sustain ~50k/s (with ~5KB event size). Our inbound is able to scale horizontally and so can go waaaaay beyond 50k/s if needed.

We have a ton of service instrumentation so we've got good visibility around throughput (and thus should know well in advance as to when we're starting to hit limits).

Congrats on the launch!

Two questions:

- If I have some data in Kafka, why would I want to pump it into your platform instead of spawning an Elasticsearch instance and using something like Kafka Connect to write to it and gain visibility?

- If I use Kafka as a permanent data store (with infinite retention), I can easily replay all events with existing clients (or with plumber). What additional functionality does the "replay" feature offer compared to that?

Hey there!

> - If I have some data in Kafka, why would I want to pump it into your platform instead of spawning an Elasticsearch instance and using something like Kafka Connect to write to it and gain visibility?

To avoid having to build, own and maintain the infra you just mentioned. As the number of events on your system increase, you will have to scale ES and other pieces of the system as well.

Our point is just that - if you know what's involved in collecting and indexing the events - that is awesome but maybe you shouldn't have to spend time building the infra around that stuff.

> If I use Kafka as a permanent data store (with infinite retention), I can easily replay all events with existing clients (or with plumber). What additional functionality does the "replay" feature offer compared to that?

I think it depends on your definition of "easily replay" - a kafka replay for a topic that's being consumed by a consumer group would require you to disconnect that consumer group and then run a shell script to move the offsets. You also would not have any way to replay any specific messages - your only point of reference would be an offset (and keyname, if you use it) - not terribly flexible.

With Batch, you get to drill in and replay the _exact_ messages you want (and avoid having to pump and dump potentially millions of messages your consumer doesn't care about).

Makes sense, thanks for the clarification!

I hope for the founders this ages like the Dropbox comment.

If I'm writing all my messages to durable storage why not work off the durable storage? I'm definitely not an expert in this area so perhaps I'm missing something. My logic is that if you're paying the resource cost to write all your messages why not pay the resource cost to read/write back there?

I think you're asking why don't we just be the hosted kafka/rabbitmq/etc and offer all of this stuff in one place. (let me know if that's wrong).

That's a totally legit point - we've talked about offering it all in-house before but it would require us to split our efforts into two - operating a PaaS (for a bunch of different messaging tech) and running the event collection platform.

Operating the PaaS part would be a full-time effort and there's a lot of competition out there. We've decided to focus on the observability/replay part first (since there is a lot less competition) and then later maybe explore the hosted bus option.

LMK if that's not what you meant :)

>I think you're asking why don't we just be the hosted kafka/rabbitmq/etc and offer all of this stuff in one place.

The other way around. If I'm not storing my messages today it's probably because it is too expensive in terms of storage or compute to do so. But, presumably, you can't do that any cheaper than I can. And now we are duplicating the work so even more resources are being consumed making it that much more expensive than just doing it myself.

It seems like your service is something I'd want to run pointed towards my Kafka/RabbitMQ/whatever servers. I don't see how duplicating that stream is cost effective.

Ahh gotcha. If you need event introspection, doing it in-house is extremely likely to be more expensive (and definitely time consuming) than offloading it.

For example: if you are sending serialized data on your bus - you will need to write something that will deserialize it before inserting it into your elastic search cluster - and now you're managing even more infra (message systems, decoders, document storage).

There is definitely a price attached to the luxury - but we're betting that it'll be significantly less than doing it yourself.

This looks interesting! A couple questions (that may also apply to event sourcing more generally):

- How do you handle events with side effects (sending emails, for example), and ensuring they aren't triggered on replay when they shouldn't be?

- How do you handle randomness, like uuid generation?

> How do you handle events with side effects (sending emails, for example), and ensuring they aren't triggered on replay when they shouldn't be?

Someone else already addressed this, but to paraphrase: your application should be able to deal with duplicate events (and gracefully handle side-effects).

> How do you handle randomness, like uuid generation?

Are you referring to id generation and tagging in events (ie. aggregate id's)? If so, that'd be an application responsibility - you'd have to determine how to properly attach id's.

Hmm. But that does bring up an interesting idea - what if we provided a way to "group" events and generate aggregate id's on your behalf. Maybe that's what you meant - it's an interesting idea.

We currently don't do anything "extra" in regards to grouping events - we tag each individual event but that's about it.

Event systems are able to guarantee at-least once delivery, so your application needs to be able to handle duplicate events in any case via idempotency.

I feel like a tagline like "event sourcing made easy" would hook me more and get me interested in _attempting_ to decipher your marketing page to understand the USP.

Pretty cool idea though. Hope it pans out for you guys.

Understood. Yeah, the issue here is that the entire space is pretty complex - it feels like any angle you approach it from, it'll still be complex.

Will try to figure out a way to better communicate what we do.

As you're brainstorming, try to imagine what you would shout in a loud bar (remember those?) if someone asked you what your company did. It can be a helpful mental tool to strip things down to the essentials.

Two-sentence pitches are much harder than two-minute pitches.

Well then perhaps not "made easy" at least made _easier_!

One of my previous co. used Kafka and hacked something similar together on an internal Retool dashboard + DynamoDB. This definitely makes a lot of sense!

Will this work with Celery (python) configured with RabbitMQ as the broker?

It will work on any Rabbit queue as long as you are not using the default exchange for the queue.

Hi Dan/Ustin,

Congrats on the launch. The pain-point makes sense to me. I'm just curious - what's the big picture for you all? I imagine it must be larger than just replay.

Batch is betting that more companies are going to be utilizing event sourcing in order to scale. We want to be a foundation piece in their data infrastructure and support their transition into event sourcing by initially offering replays. We want to be a "One stop shop" for all event sourcing needs.

Cool! I don't have much data on how many companies are using events for key workflows, but I do know that many, many companies would _love_ to replay HTTP requests!

That's awesome! We support http and gRPC collection as well. Let us know what you have in mind.

It seems like you're solving quite a complex problem!

I'm curious how long it took you to build this initial product given the complexity.

YC has a bias for shipping quickly, but my gut instinct is that it would have taken you a while to build this initial version.

Did it only take a few months, or closer to 8-12+?

Ha, great observation!

Yes and no :)

We've been exceptionally lucky to have several of our close friends help us out with building an MVP (also helps that our friends have serious experience!). There's a total of 6 of us - 3 people focusing on infra, frontend and Java connector bits, which allowed myself, Ustin and another dev to put 100% of our attention on backend services + arch.

That enabled us to knock this out in a few months. Without the assist, it would probably be closer to your estimate.

Something that may be of interest to some folks: we saved a significant amount of time by not having to run our own k8s - we use EKS, it's very nice. Also, MSK - not having to run/manage ZK clusters and kafka nodes is a (costly) privilege haha

Nice, I'm jealous of the extra help you're getting haha (I'm trying to build something solo). It sounds like you're already making a lot of progress.

Good luck building out the rest of the product!

I've solved the replaying bit before with a brute approach and AWS Athena. It would ingest all the events from S3, filter the unwanted ones out, and put the rest in SQS ready for consumption. It was definitely expensive though, not something you would run often.

This is definitely a valid approach and it is an expensive path. Athena charges a hefty amount per query if you have a large dataset. In addition, this approach won't work if your data is serialized with something like protobuf.

Very interesting tool. You're absolutely right that I wouldn't spend my innovation tokens on it. Congratulations for your work !

Does this have support for Rabbit pub/sub? There's a bit of confusing wording on the page that makes it unclear.

100% - we use Rabbit internally for our own systems so it has first-class support.

I think maybe we should just list out the messaging systems we support on the front page, so you don't have to dig through stuff... Good point. Let me know if you've got any other suggestions.

I hear so much about Kafka, could someone give the two-sentence description of what it is and who uses it and for what?

From their website: Kafka is an open-source distributed event streaming platform.

There are many use cases from piping website activity tracking, metrics, log aggregation and stream processing. For us, it's a communication layer utilized by our microservices. An event goes into the stream and any services that cares about that data will consume it. In other words, it's like an ultra-resilient, scalable, redis pub-sub with history that runs on the JVM. You can read more about the use cases here: https://kafka.apache.org/uses

edit: Sidenote, Kafka is often waaaaaay overkill - if you need messaging, use something simpler like Rabbit or NATS or Redis and only use Kafka if you know why you need it.

Thanks. So an event that goes in would be something like, “user logged in,” and services that care about that data would be...? Sorry still having some trouble understanding it.

Pretty much. A good example might be an online store. Let's say one of your internal services deals with notifying UPS, FedEx, or DHL to pick up and deliver a package from your warehouse and ship to a customer. You could use something like Kafka to store messages about delivers which your internal service will pickup and process and then notify the delivery companies API.

Something like Batch could be helpful in this situation. For example, let's say a dev makes a deploy that breaks only the FedEx delivery notification or the FedEx API breaks in a way your were not expecting. Once the issue is fixed on the dev side or FEDEX side you could use Batch to search for all FEDEX delivers that were handled improperly during the time frame of the issue. This way you are not randomly resending messages to all your delivery companies for an issue that was only related to one vendor.

makes sense, thanks. How would this be better than the logs you'd get from each service?

You might have an achievement service. It watches for logins and grants a user an achievement after logging in for N consecutive days. Your authentication code need not know that an achievement service exists.

Hmm this is very clear! Thanks.

Is Pulsar support on the roadmap?

We are planning to support as many messaging systems as we can. We will definitely investigate Pulsar. Going to add it to our feature list and make an issue on plumber to support introspection on Pulsar. Cheers!

Good to know that Pulsar is on your roadmap. Also, kudos to see user-land tooling around a common painpoint for teams doing any event processing at scale.

Thank you! We felt the pain point while actively trying to build observability tools in order to debug our messaging systems. We built plumber to standardize some of our internal tools and then decided to open source it to help others who are feeling the pain.

Nice concept and interesting, expect a demo request incoming :)

Congratulations on this release....That is really useful!

Thank you very much!

'Light beers over IPA' sorry, I'm out.


Has anyone ever told you that you're "batch it crazy"?

No, but this is now definitely going on a sticker :D

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact