Hacker News new | past | comments | ask | show | jobs | submit login
A dull, resilient stream processor in Go (github.com/jeffail)
199 points by ngaut on Aug 16, 2018 | hide | past | favorite | 30 comments



Hey everyone, author here. We use Benthos as a general stream swiss army knife for all the dull tasks, but you can also use it as a framework for writing your own stream processors in Go.

It doesn't provide any tools for idempotent calculations yet, just at-least-once delivery guarantees. But when you use it as a framework you get the benefit of all the configurable processors in your new service.


"all the dull tasks"

Now that is how you advertise at me. I've seen more "minimal flexible generic configurable simple performant powerful easy extensible non-bloated" libraries than I can shake a stick at. But "this helps with the dull stuff"... that's a sales pitch I can get behind!

(A constructive hint to those describing their libraries: At this point, use of pretty much any of those words in your summary sentence is a massive negative to me. They are of null content, and you just burned the most valuable marketing space you had on these null-content words. Everybody says that about their libraries. Tell me what it does in the summary. If you really want to make the point for one of those words, do it later in the documentation, because if your summary sounds like you copy/pasted my quoted bit above, I'm not even going to read your extended documentation. And when you do, provide evidence; if you're "performant", show me competitive benchmarks. I get that they're going to be biased, but it at least shows me what's important. If you're "simple" show me in the sample code. Etc.)


Add "elegant" onto that list and I think you've hit the nail on the head.

These words aren't descriptors, they don't explain anything or give any further context about the nature of a library.


But is it "for humans?"


Not the OP or author but for anyone interested about it this video expends on the implementation of Benthos https://www.youtube.com/watch?v=NM7X4PIUQB0

As someone who is getting into Golang quite a bit I found it very interesting to peer into the thought process that led to Benthos and how it was/is implemented.


After reading the section about processors it is not clear whether it can do stateful computations, for example, running sum or moving average. Can I define sliding windows? Can I join two or more streams?


All Benthos processors are stateless, except for a few such as batch, combine, etc, which do basic tasks such as building up batches from consecutive messages. It's possible to create sliding window processors on top of Benthos, but there aren't any general tools for doing that yet.


For stateful processing you need something like:

* https://kafka.apache.org/documentation/streams/ - Kafka Streams (and its KTable)

* https://flink.apache.org/ - Flink

* https://spark.apache.org/streaming/ - Spark Streaming

* https://github.com/asavinov/bistro - Bistro Streams

The focus of Benthod seems to be performance and resilience (implemented without persistence).


Or to toot the horn of the project I work on:

* https://github.com/wallaroolabs/wallaroo - Wallaroo


Interesting! We use NSQ at work to collect logs/messages from various systems and forward them to various endpoints (Kafka, ELK, archives, Prometheus). This looks like a more advanced/flexible system, where as we just use a handful of single purpose daemons writen in go (nsq2kafka, nsq2es, nsqarchive, nsqstream-metrics).

I gave a quick intro to nsq talk at our local go meetup recently. If you haven't used NSQ before I highly recommend giving it a try.

https://docs.google.com/presentation/d/1e9yIm-0aNba_H1gX_u7D...


Talk about timing!

I'm in the process of rejigging our live telemetry pipeline (dealing with petabytes) to improve flexibility and resiliency and have been looking for something to replace our similar systems that have grown organically. This fits the bill exactly; it's designed exactly how I imagined our new system should look.

I'm wondering if HN has their personal favorite for these sorts of problems.


We've been using logstash as our go-to stream processing glue, its large selection of input, output, and filter plugins lets it handle just about anything, reliably. My one complaint would be the use of the jvm, particularly the slow startup time it causes when doing development.


Supported Sources & Sinks: Amazon (S3, SQS), Elasticsearch (output only), File, HTTP(S), Kafka, MQTT, Nanomsg, NATS, NATS Streaming, NSQ, RabbitMQ (AMQP 0.91), Redis, Stdin/Stdout, Websocket, ZMQ4.


I must be missing something. It doesn't look dull.


Looks like this could be useful for webhooks delivery, a problem our team is working on at the moment.

We have 100's of tenants, each of which could get their own stream with some delivery guarantees set. If one tenant's endpoint is down or another tenant fills the stream with 1M messages, it should not affect delivery rates of other tenants. Seems to fit the bill.

I see the HTTP output mentions some retries, but I guess these run as part of the delivery step, blocking this goroutine as opposed to rescheduling the message? Sometimes it takes hours for clients to restore their receiving systems and it would be great if messages for past N hours would still be delivered..


I'd strongly suggest using Elixir for this but then I'm a bit of a fan boy ;-)

https://phoenixframework.org/blog/the-road-to-2-million-webs...


Hey, the HTTP output has a fixed number of retries, after which you could either have some mechanism in place to fall back on or by default it will simply continue the retries again whilst blocking upstream. You might also be interested in running Benthos in streams mode: https://github.com/Jeffail/benthos/tree/master/docs/streams

Streams mode lets you run as many isolated stream pipelines as you want in the same process, which in your case could be a simple queue -> webhook bridge. You can manage these pipelines either statically in config files, or dynamically through a REST API.


That seems really promising, especially configuring pipelines with API calls. Many wheels may be left uninvented.

Thank you for making your work available!


This seems pretty similar to Wallaroo but with more focus on connecting to everything. And of course golang vs Pony


Hey, Benthos doesn't yet provide any general tooling for exactly-once processing like Wallaroo does, that's possibly a goal for the future.

My main focus has been providing general purpose stateless processors. So you can build a your own stream processor focusing on what makes it unique, and then the moment it's compiled and packaged it can read and write to anything, and convert any kind of payload to anything else just through configuration.


Hi Wallaroo developer here,

I'm curious, does Benthos manage state for the user/application like we do in Wallaroo or is it purely stateless computations at this point?


Processor implementations are able to carry their own state, or share state across worker threads or deployments using their own mechanisms, but Benthos doesn't provide any tooling for that. None of the processors you get out of the box need computational state. You currently get an ALO stream (provided you use ALO protocols), vertical & horizontal scaling as per your config, and any glue you need between services.

It would be a nice stretch goal to have standard tooling within Benthos to share distributed state, perhaps with some ability to do exactly-once processing, but that's not the focus of the project right now.


That is a very interesting prospect actually, any plans on making python integration simple? Some parts of my pipelines rely on very large python modules I don't see being rebuilt in anything else in the next few years.


Is "dull" a technical term or does it literally just mean boring here?


Yeah just boring. Benthos is mostly a collection of standard processors for doing boring stuff.


In what situations would a tool like this be useful? Asking honestly. Thanks.


This seems quite a bit like Heka. Very excited.


Super exciting, I can imagine using this instead of Kafka for more simple use cases. Wonder if they will add a full logging system so we can dump Kafka completely


That doesn’t seem to fit with the project goal, it’s not Kafka-centric. It’s more like an alternative for Kafka Connect in that regard imo.


did you think of using the Go apache beam API ? any comparisons with oss alternatives?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: