Hacker News new | comments | show | ask | jobs | submit login
Faust: Stream Processing for Python (github.com)
205 points by ericliuche 76 days ago | hide | past | web | favorite | 59 comments



> Faust is extremely easy to use [...] Faust only requires Kafka, the rest is just Python

Kafka is still a pretty chunky dependency.

Have you thought about getting this to work with the new streams stuff in Redis? https://brandur.org/redis-streams

Redis is SO MUCH less hassle to get running (in both dev and production) than Kafka.


My name is Ask and I'm the co-creator of this project, along with Vineet.

Yes, we're definitely interested in supporting Redis Streams!

Faust is designed to support different brokers, but Kafka is the only implementation as of now.


Hey, I am a huge user of Apache Flink in personal and work projects. Skimmed through your readme, but still have the following question.

What are the advantages of using Faust feature-wise or otherwise? Are you guys planning on having feature parity with Flink (triggers, evictors, process function equivalent, etc.)? I can definitely see its use in having better compatibility with ML/AI env and the other tools in python toolkit. But specifically regarding ML/AI, I can export those things into the JVM usually. And with regards with Flask/Django, I can use scala http4s.

Sorry if that seemed like a lot! Trust me very excited about this project!


You are the first Flink user I have come across! Could you briefly explain why it is so compelling over its competitors?


When it comes to batch, I still think Apache Spark and the standard scientific computing stack in Python are king.

But when it comes to streaming, I don't think Spark Structured Streaming was very close, although Spark 2.3.0 might be sufficient for your use cases, and I use spark structured streaming only when I have to deploy spark pipeline/ml.

But, besides the features I listed, I think Flink is just better written for streaming.

For example, defining checkpoints and state management in Flink is so much more expressive and easier to write (and perhaps more performant from what I saw at Flink Forward).

Flink Async is amazing!!!!

The Flink table/sql API is not as nice as Spark SQL for batch right now, but it is getting there.

Flink also seems to perform better than Spark Structured Streaming when you turn on object reusability, at least according to Alibaba and Netflix.

And then, the features I listed are amazing. Evictors and triggers are super helpful. Process Functions let me write custom aggregations without a lot of mental overhead.

Time Characteristics are really nice in Flink.

Flink has a lot of nice connectors and stuff written already (although Parquet files are troublesome though if they come from Spark because of the way that Spark handles it read and write support classes). For example, you have to write your own kinesis connector in spark structured streaming (or use the one put out by databricks on their blogs).

Sorry if this seems all over the place. I just posted what came to mind. And I should add, Databricks is doing a lot of work to make Spark Structured Streaming really comparable, so who knows what stuff looks like in 2-3 years. Like I said above, I use Spark Structured Streaming (2.3.0+) for some very specific ML stuff that my Flink environment cannot handle nicely yet.

My only complaint is that there isn't a lot of community written code out there, but the Data Artisans team is super active, helpful, and nice.


Thanks, this is super helpful and exactly what I am looking for. My main prospective use of Flink is as a target for Apache Beam, GCP Dataflow is great, but I want to have portable jobs, and Flink looks like the best target (over Spark).


Never used Beam before, and I am unsure if it uses all the features of Flink. And the documentation seems a bit sparse (I couldn't find how to tune checkpoints for example on a quick glance). But if you know how to tune it to use more of Flink or it works for your use case (it supports Python and Go), then go for it. It looks pretty neat.


I'd love to see some docs/examples around supporting pluggable brokers; Faust looks like exactly what I want from a python stream processor, but for various technical and operational reasons it would be nice to be able to provide a RabbitMQ/AMQP broker.


We added two levels of abstractions for this purpose:

- A Stream iterates over a channel

- A Channel implements: `channel.send` and `channel.__aiter__`.

- A topic is a "named" channel backed by a Kafka topic

- Further the topic is backed by a Transport

- Transport is very Kafka specific

To implement support for AMQP the idea is you only need to implement a custom channel.

If you open an issue we can consider how to best implement it.


Well I love Celery, so I'm even more excited to learn more about Faust now :)


I'm looking at Redis Streams and it seems to lack the partitioning component of Kafka. One of the properties that we rely on is the ability to manually assign partitions such that a worker assigned partition 0 of one topic, will also be assigned partition 0 of other topics that it consumes from.

Simplicity is of course a goal, but this may mean we have to sacrifice some features when Redis Streams is used as a backend.

I'm glad you like Celery, this project in many ways realize what I wanted it to be.


If you liked Celery 3.x but don't like all the bugs in Celery 4.x, check out these alternative Python task queues:

https://github.com/closeio/tasktiger (supports subqueues to prevent users bottlenecking each other)

https://github.com/Bogdanp/dramatiq (high message throughput)


In Faust we've had the ability to take a very different approach towards quality control. We have a cloud integration testing setup, that actually runs a series of Faust apps in production. We do chaos testing: randomly terminating workers, randomly blocking network to Kafka, and much more. All the while monitoring the health of the apps, and the consistency of the results that they produce.

If you have a Faust app that depends on a particular feature we strongly suggest you submit it as an integration test for us to run.

Hopefully some day Celery will be able to take the same approach, but running cloud servers cost money that the project does not have.


Any interest in pulsar support?


Or even better: using SSDB, a drop in replacement for Redis that scales to larger data sets.

https://github.com/ideawu/ssdb


I don’t see any discussion of idempotent processing or the Kafka transactions feature in the documentation, and I also see a mention of acking unprocessed messages in the form of acking if an exception is thrown in the application.

These seem like pretty important details to acknowledge up front.

Are there plans to support idempotent processing, or is this already implemented and I just missed it?

The developer facing API looks very nice!


It's hard to say, but Kafka provides a number of important predicates for idempotent processing out of the box for consumers and producers. Given that stream-style-apps are just a series of consumers and producers linked together, there is good reason to believe this would offer idempotent processing.


Hi, thanks for the comments. My name is Vineet and I am a co-creator of this project along with Ask.

We do plan on adding support for Kafka exactly once semantics to Faust. As @dleclere pointed out, streaming apps are simply a topology of consumers and producers. We will be adding this as soon as the underlying kafka clients add support for exactly once semantics.


"Faust" is a name-collision with another piece of software that does (audio) stream processing: http://faust.grame.fr/.


It's also a name collision with a whole bunch of other things. Pretty soon we'll need some namespaces to help distinguish these things, or maybe just context.


> Pretty soon we'll need some namespaces to help distinguish these things, or maybe just context

Quite the /faustian/ bargain you propose...


What is the tradeoff/downside to namespaces?

Other than their being cumbersome, but that is upfront, so I don't see the Faustian-ness there.


The better the comedic payoff, the flimsier the alignment with reality needs to be.

As a corollary though - the stronger the alignment with reality, the better the comedic payoff - all other things being equal.


Well, that's about the only one related to stream programming.

It is called Functional Audio Stream for a reason.


How does Faust work with Django? There’s a mention on needing gevent or eventlet to bridge, but nothing about how you’d integrate the two.

I’m assuming the entrypoint would need a django.setup, and then you’d import the modules that had the Faust tasks? A how-to in the docs would be very useful.

I see in one of the other comments about acking due to an exception. This is one of the major issues with celery IMO (at least before version 4 before the new setting was added). I’d hope that exception acking would at least be configurable.


There's an examples/django project in the distribution. I think they removed the gevent bridge from PyPI for some reason, but you can still use the eventlet one. Gevent is production quality and the concept of bridging them is sound, so hope someone will work on it.


The docs link doesn't seem to be live yet:

http://docs.fauststream.com/en/latest/introduction.html

One question I had is how Robinhood might be doing message passing and persistence. For example, I might have a stream, that creates more messages that needs to processed, and then eventually I would want to persist the "Table" to an actual datastore.


My name is Ask and I am one of the co-creators along with Vineet.

Thanks for pointing this out. Fixed the links. You can also find the docs here: http://faust.readthedocs.io/en/latest/

Faust uses Kafka for message passing. The new messages you create can be pushed to a new topic and you could have another agent consuming from this new topic. Check out the word count example here: https://github.com/robinhood/faust/blob/9fc9af9f213b75159a54...

Also note that the Table is persisted in a log compacted Kafka topic. This means, we are able to recover the state of the table in the case of a failure. However, you can always write to any other datastore while processing a stream within an agent. We do have some services that process streams and storing state in Redis and Postgres.


What I'm thinking of is flushing the table to a secondary storage so that other services can query that data.

I think Storm/Flink have the concept of a "tick tuple", a message that comes every n-seconds to tell the worker to flush the table to some other store. I've been looking over the project, and I'm not sure how I would do this in Faust yet, as far as I understand the "Table" is partitioned, so you'd have to send a tick to every worker.

Very interesting project!


Faust can serve the data over HTTP/websockets and other transports, so you can query it directly!


In addition to that you can have an arbitrary method run at an interval using the app.timer decorator. You can use this to flush every n seconds. You could also use stream.take to read batches of messages (or wait t seconds, whichever happens first) and write to an external source.


How does compare with Celery, given the author of both libraries is the same?

Is it an apple and oranges comparison? When would one use this instead of Celery?

Found the following for those curious: http://faust.readthedocs.io/en/latest/playbooks/vscelery.htm...



I guess the libraries ultimately serve different purposes but the Celery example is 100 times more approachable than the Faust example.


Yes, they do serve different purposes but they also share similarities. You could easily write a task queue on top of Faust.

It's important to remember that users had difficulty understanding the concepts behind Celery as well, perhaps it's more approachable now that you're used to it.

Using an asynchronous iterator for processing events enables us to maintain state. It's no longer just a callback that handles a single event, you could do things like "read 10 events at a time", or "wait for events on two separate topics and join them".


I was kinda wondering about the performance of a streaming lib written in Python, but from the README it looks like they support high-performance dependencies like uvloop, RocksDB and C extensions. And it's used in production at Robinhood. This could have been written in Go, but it was written in Python. Very interesting.


In my experience, if performance is an absolute need, there are plenty of other frameworks out there. Apache Flink and Apache Apex come to mind, even though they are JVM.

Whats interesting about the Python angle is I've noticed there are data science teams who would want to manage streams in production, however there tends to be a heavy dependency on python and its ecosystem (numpy, scipy, tensorflow).


Looks similar to the project my team uses for Kafka / Kinesis stream processing in Python / Django projects: https://gitlab.com/thelabnyc/django-logpipe



Seems to be working for me: http://faust.readthedocs.io/en/latest/


The link is wrong... make a PR!


Well, a bit off-topic but all this naming chaos has gone way too far. Kafka/Faust and so on and so on. I found it really disturbing the way we treat and abuse such meaningful heritage.


I wouldn't worry about it too much. The vast difference in contexts (literary circles, high school classrooms VS. HN comment pages, Slack chatrooms, developer podcasts) ensures there isn't any confusion. If anything, the choice of name speaks to the importance of its source.


> Tables are stored locally on each machine using a superfast embedded database written in C++, called RocksDB.

I haven't used RocksDB. Does being a LevelDB fork give it similar corruption issues?


We have been using RocksDB for some time and it works fairly well. While we haven't seen any corruption yet, as you correctly pointed out, it is definitely a possibility.

We use RocksDB only as a cache and use log-compacted kafka topics as the source of truth. In the case of RocksDB corruption, we can simply delete the local rocksdb cache and the faust app should recover (and rebuild the rocksdb state) from the log compacted Kafka topic.


How swappable is the underlying DB from RocksDB to something else? Faust seems awesome, but I imagine in many corporate settings having multiple DB vendors in Production can be difficult.


Faust looks really cool with its native Python implementation. How does this compared with Apache Beam or Google's Dataflow as they have recently rolled out their Python SDK?


Faust is a library that you can import into your Python program, and all it requires is Kafka. Most other stream processing systems require additional infrastructure. Kafka Streams has similar goals, but Faust additionally enables you to use Python libraries and perform async I/O operations while processing the stream.


Is there something like this for node.js/typescript?



Do you have any thoughts for creating Models from Avro schemas?


We had support for it initially, but we ended up not using the schema registry server. We want to add support for this back in if it provides value.


While the page does mention alternatives (Flink, Storm, Samza, Spark Streaming) it does not compare Faust to those alternatives. And it does not answer the most important question for a prospective user: why would I use this instead?

Ex ante Python seems like a poor choice for a large distributed system where network and CPU efficiency are important. Existing solutions have good performance, are easy to use, and thanks to broad adoption and community support, are very well maintained.


> why would I use this instead?

I can't speak for the developer, but a few things stood out to me:

1) It's Python. Which means there's no impedance mismatch in using numerical/data science libraries like Numpy and the like. For data engineers trying to productionize data science workloads, this is quite compelling -- no need to throw away all of the Python code written by data scientists. This also lowers the barrier to entry for streaming code.

2) I had the same reservations about CPU efficiency, but it looks like they're using best-of-class libraries (RocksDB is C++, uvloop is C/Cython, etc.). I was at a PyCon talk where the speaker demo'ed Dask (a distributed computation system similar to Spark) running on Kubernetes and it was very impressive. Scalability didn't seem to be an issue. Dask actually outperformed Spark in some instances.

I wonder if Kubernetes is the key to making these types of solutions competitive with traditional JVM type distributed applications like Spark Streaming, etc.

3) Not all streaming data is real-time. In fact, streaming just means unbounded data with no inherent stipulation of near real-time SLAs. Real-time is actually a much stricter requirement.


My name is Ask and I'm co-creator on this project, along with Vineet Goel.

This answer sums up why we wanted to use Python for this project: it's the most popular language for data science and people can learn it quickly.

Performance is not really a problem either, with Python 3 and asyncio we can process tens of thousands of events/s. I have seen 50k events/s, and there are still many optimizations that can be made.

That said, Faust is not just for data science. We use it to write backend services that serve WebSockets and HTTP from the same Faust worker instances that process the stream.


> Ex ante Python seems like a poor choice for a large distributed system where network and CPU efficiency are important

Not every component in a distributed system needs to be "efficient". There is a broad class of software solving real problems that can benefit from Python having a stream processor on top of robust infrastructure.


I would consider myself to be fairly agnostic when it comes to languages (although Python happens to be the language I work most with, which has a lot to do with what my coworkers prefer). What I can say is, that quite often, those large distributed system solutions has high constant offsets / penalties if you aren't building the BIG solution. so everything, starting from functional testing to those small tasks that happen more often than one would like take ages. I wouldn't be surprised if a python stream processing framework could provide a good developer-happiness -- efficiency ratio, just by being responsive.


Because if your topics' throughput is low (say <1000 per sec), all those tools might be overkill as most of them require a new cluster, significant setup and using new framework. This library might help quickly build consumers with little to no extra setup/infra.

Also, Robinhood using this in production says this performs reasonably well and battle tested to handle some operations behind a popular app such as theirs.

Definitely excited to try this out to see how this fares compared to Kafka Streams and Spark Structured streaming, both of which we use.


Not having to use systems such as Yarn/Mesos was a big motivation for us. Faust works well with any existing deployment systems (we currently use supervisor with EC2 instances) you may have and definitely made it easy for teams to start building/deploying apps. Further since Faust is a library, it can work with any existing python projects/tools - we use it both independently and alongside Django.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: