
Faust: Stream Processing for Python - ericliuche
https://github.com/robinhood/faust
======
simonw
> Faust is extremely easy to use [...] Faust only requires Kafka, the rest is
> just Python

Kafka is still a pretty chunky dependency.

Have you thought about getting this to work with the new streams stuff in
Redis? [https://brandur.org/redis-streams](https://brandur.org/redis-streams)

Redis is SO MUCH less hassle to get running (in both dev and production) than
Kafka.

~~~
asksol
My name is Ask and I'm the co-creator of this project, along with Vineet.

Yes, we're definitely interested in supporting Redis Streams!

Faust is designed to support different brokers, but Kafka is the only
implementation as of now.

~~~
simonw
Well I love Celery, so I'm even more excited to learn more about Faust now :)

~~~
welder
If you liked Celery 3.x but don't like all the bugs in Celery 4.x, check out
these alternative Python task queues:

[https://github.com/closeio/tasktiger](https://github.com/closeio/tasktiger)
(supports subqueues to prevent users bottlenecking each other)

[https://github.com/Bogdanp/dramatiq](https://github.com/Bogdanp/dramatiq)
(high message throughput)

~~~
asksol
In Faust we've had the ability to take a very different approach towards
quality control. We have a cloud integration testing setup, that actually runs
a series of Faust apps in production. We do chaos testing: randomly
terminating workers, randomly blocking network to Kafka, and much more. All
the while monitoring the health of the apps, and the consistency of the
results that they produce.

If you have a Faust app that depends on a particular feature we strongly
suggest you submit it as an integration test for us to run.

Hopefully some day Celery will be able to take the same approach, but running
cloud servers cost money that the project does not have.

------
ryanworl
I don’t see any discussion of idempotent processing or the Kafka transactions
feature in the documentation, and I also see a mention of acking unprocessed
messages in the form of acking if an exception is thrown in the application.

These seem like pretty important details to acknowledge up front.

Are there plans to support idempotent processing, or is this already
implemented and I just missed it?

The developer facing API looks very nice!

~~~
dleclere
It's hard to say, but Kafka provides a number of important predicates for
idempotent processing out of the box for consumers and producers. Given that
stream-style-apps are just a series of consumers and producers linked
together, there is good reason to believe this would offer idempotent
processing.

~~~
vineetgoel1992
Hi, thanks for the comments. My name is Vineet and I am a co-creator of this
project along with Ask.

We do plan on adding support for Kafka exactly once semantics to Faust. As
@dleclere pointed out, streaming apps are simply a topology of consumers and
producers. We will be adding this as soon as the underlying kafka clients add
support for exactly once semantics.

------
tern
"Faust" is a name-collision with another piece of software that does (audio)
stream processing: [http://faust.grame.fr/](http://faust.grame.fr/).

~~~
xapata
It's also a name collision with a whole bunch of other things. Pretty soon
we'll need some namespaces to help distinguish these things, or maybe just
context.

~~~
stock_toaster
> Pretty soon we'll need some namespaces to help distinguish these things, or
> maybe just context

Quite the /faustian/ bargain you propose...

~~~
kranner
What is the tradeoff/downside to namespaces?

Other than their being cumbersome, but that is upfront, so I don't see the
Faustian-ness there.

~~~
p1necone
The better the comedic payoff, the flimsier the alignment with reality needs
to be.

As a corollary though - the stronger the alignment with reality, the better
the comedic payoff - all other things being equal.

------
jsmeaton
How does Faust work with Django? There’s a mention on needing gevent or
eventlet to bridge, but nothing about how you’d integrate the two.

I’m assuming the entrypoint would need a django.setup, and then you’d import
the modules that had the Faust tasks? A how-to in the docs would be very
useful.

I see in one of the other comments about acking due to an exception. This is
one of the major issues with celery IMO (at least before version 4 before the
new setting was added). I’d hope that exception acking would at least be
configurable.

~~~
asksol
There's an examples/django project in the distribution. I think they removed
the gevent bridge from PyPI for some reason, but you can still use the
eventlet one. Gevent is production quality and the concept of bridging them is
sound, so hope someone will work on it.

------
nemothekid
The docs link doesn't seem to be live yet:

[http://docs.fauststream.com/en/latest/introduction.html](http://docs.fauststream.com/en/latest/introduction.html)

One question I had is how Robinhood might be doing message passing and
persistence. For example, I might have a stream, that creates more messages
that needs to processed, and then eventually I would want to persist the
"Table" to an actual datastore.

~~~
asksol
My name is Ask and I am one of the co-creators along with Vineet.

Thanks for pointing this out. Fixed the links. You can also find the docs
here:
[http://faust.readthedocs.io/en/latest/](http://faust.readthedocs.io/en/latest/)

Faust uses Kafka for message passing. The new messages you create can be
pushed to a new topic and you could have another agent consuming from this new
topic. Check out the word count example here:
[https://github.com/robinhood/faust/blob/9fc9af9f213b75159a54...](https://github.com/robinhood/faust/blob/9fc9af9f213b75159a5413c831fe75eebebbbc51/examples/word_count.py)

Also note that the Table is persisted in a log compacted Kafka topic. This
means, we are able to recover the state of the table in the case of a failure.
However, you can always write to any other datastore while processing a stream
within an agent. We do have some services that process streams and storing
state in Redis and Postgres.

~~~
nemothekid
What I'm thinking of is flushing the table to a secondary storage so that
other services can query that data.

I think Storm/Flink have the concept of a "tick tuple", a message that comes
every n-seconds to tell the worker to flush the table to some other store.
I've been looking over the project, and I'm not sure how I would do this in
Faust yet, as far as I understand the "Table" is partitioned, so you'd have to
send a tick to every worker.

Very interesting project!

~~~
asksol
Faust can serve the data over HTTP/websockets and other transports, so you can
query it directly!

~~~
vineetgoel1992
In addition to that you can have an arbitrary method run at an interval using
the app.timer decorator. You can use this to flush every n seconds. You could
also use stream.take to read batches of messages (or wait t seconds, whichever
happens first) and write to an external source.

------
ludwigvan
How does compare with Celery, given the author of both libraries is the same?

Is it an apple and oranges comparison? When would one use this instead of
Celery?

Found the following for those curious:
[http://faust.readthedocs.io/en/latest/playbooks/vscelery.htm...](http://faust.readthedocs.io/en/latest/playbooks/vscelery.htmlt)

~~~
vvatsa
Typo in link,
[http://faust.readthedocs.io/en/latest/playbooks/vscelery.htm...](http://faust.readthedocs.io/en/latest/playbooks/vscelery.html)

~~~
randlet
I guess the libraries ultimately serve different purposes but the Celery
example is 100 times more approachable than the Faust example.

~~~
asksol
Yes, they do serve different purposes but they also share similarities. You
could easily write a task queue on top of Faust.

It's important to remember that users had difficulty understanding the
concepts behind Celery as well, perhaps it's more approachable now that you're
used to it.

Using an asynchronous iterator for processing events enables us to maintain
state. It's no longer just a callback that handles a single event, you could
do things like "read 10 events at a time", or "wait for events on two separate
topics and join them".

------
wenc
I was kinda wondering about the performance of a streaming lib written in
Python, but from the README it looks like they support high-performance
dependencies like uvloop, RocksDB and C extensions. And it's used in
production at Robinhood. This could have been written in Go, but it was
written in Python. Very interesting.

~~~
nemothekid
In my experience, if performance is an absolute need, there are plenty of
other frameworks out there. Apache Flink and Apache Apex come to mind, even
though they are JVM.

Whats interesting about the Python angle is I've noticed there are data
science teams who would want to manage streams in production, however there
tends to be a heavy dependency on python and its ecosystem (numpy, scipy,
tensorflow).

------
crgwbr
Looks similar to the project my team uses for Kafka / Kinesis stream
processing in Python / Django projects: [https://gitlab.com/thelabnyc/django-
logpipe](https://gitlab.com/thelabnyc/django-logpipe)

------
dinedal
Docs seem to be down?
[http://docs.fauststream.com/en/latest/playbooks/quickstart.h...](http://docs.fauststream.com/en/latest/playbooks/quickstart.html)

~~~
ashesha20
Seems to be working for me:
[http://faust.readthedocs.io/en/latest/](http://faust.readthedocs.io/en/latest/)

~~~
timothylaurent
The link is wrong... make a PR!

------
agigao
Well, a bit off-topic but all this naming chaos has gone way too far.
Kafka/Faust and so on and so on. I found it really disturbing the way we treat
and abuse such meaningful heritage.

~~~
ds0
I wouldn't worry about it too much. The vast difference in contexts (literary
circles, high school classrooms VS. HN comment pages, Slack chatrooms,
developer podcasts) ensures there isn't any confusion. If anything, the choice
of name speaks to the importance of its source.

------
peterwwillis
> Tables are stored locally on each machine using a superfast embedded
> database written in C++, called RocksDB.

I haven't used RocksDB. Does being a LevelDB fork give it similar corruption
issues?

~~~
vineetgoel1992
We have been using RocksDB for some time and it works fairly well. While we
haven't seen any corruption yet, as you correctly pointed out, it is
definitely a possibility.

We use RocksDB only as a cache and use log-compacted kafka topics as the
source of truth. In the case of RocksDB corruption, we can simply delete the
local rocksdb cache and the faust app should recover (and rebuild the rocksdb
state) from the log compacted Kafka topic.

------
kyloon
Faust looks really cool with its native Python implementation. How does this
compared with Apache Beam or Google's Dataflow as they have recently rolled
out their Python SDK?

~~~
asksol
Faust is a library that you can import into your Python program, and all it
requires is Kafka. Most other stream processing systems require additional
infrastructure. Kafka Streams has similar goals, but Faust additionally
enables you to use Python libraries and perform async I/O operations while
processing the stream.

------
timothylaurent
Do you have any thoughts for creating Models from Avro schemas?

~~~
asksol
We had support for it initially, but we ended up not using the schema registry
server. We want to add support for this back in if it provides value.

------
rilut
Is there something like this for node.js/typescript?

~~~
speeq
[https://nsq.io](https://nsq.io)

------
wskinner
While the page does mention alternatives (Flink, Storm, Samza, Spark
Streaming) it does not compare Faust to those alternatives. And it does not
answer the most important question for a prospective user: why would I use
this instead?

Ex ante Python seems like a poor choice for a large distributed system where
network and CPU efficiency are important. Existing solutions have good
performance, are easy to use, and thanks to broad adoption and community
support, are very well maintained.

~~~
wenc
> why would I use this instead?

I can't speak for the developer, but a few things stood out to me:

1) It's Python. Which means there's no impedance mismatch in using
numerical/data science libraries like Numpy and the like. For data engineers
trying to productionize data science workloads, this is quite compelling -- no
need to throw away all of the Python code written by data scientists. This
also lowers the barrier to entry for streaming code.

2) I had the same reservations about CPU efficiency, but it looks like they're
using best-of-class libraries (RocksDB is C++, uvloop is C/Cython, etc.). I
was at a PyCon talk where the speaker demo'ed Dask (a distributed computation
system similar to Spark) running on Kubernetes and it was very impressive.
Scalability didn't seem to be an issue. Dask actually outperformed Spark in
some instances.

I wonder if Kubernetes is the key to making these types of solutions
competitive with traditional JVM type distributed applications like Spark
Streaming, etc.

3) Not all streaming data is real-time. In fact, streaming just means
unbounded data with no inherent stipulation of near real-time SLAs. Real-time
is actually a much stricter requirement.

~~~
asksol
My name is Ask and I'm co-creator on this project, along with Vineet Goel.

This answer sums up why we wanted to use Python for this project: it's the
most popular language for data science and people can learn it quickly.

Performance is not really a problem either, with Python 3 and asyncio we can
process tens of thousands of events/s. I have seen 50k events/s, and there are
still many optimizations that can be made.

That said, Faust is not just for data science. We use it to write backend
services that serve WebSockets and HTTP from the same Faust worker instances
that process the stream.

