
Streamz: Python pipelines to manage continuous streams of data - aeontech
https://streamz.readthedocs.io/en/latest/index.html
======
chinmaychandak
We at NVIDIA are heavily invested in developing streamz. We’ve made streamz
compatible with RAPIDS cuDF, which means streaming jobs in Python can be GPU-
accelerated now.

Folks interested in GPU-streaming data using 100% native Python (no Spark
setup needed is a big win) can look up the Anaconda package called custreamz,
which is part of NVIDIA RAPIDS open-source GPU Data Science libraries.

Streaming feature-parity-wise, we’ve made Kafka integration robust, and added
checkpointing to streamz which is a must-have feature for production streaming
pipelines.

I’d be happy to answer any questions you guys may have, and would love to have
more people use streamz and contribute if possible.

As for scaling for big data streaming, streamz works well with Dask, so GPU-
accelerated streaming in distributed mode is on! :)

~~~
albertzeyer
How does it integrate into other GPU-accelerated frameworks such as TensorFlow
or PyTorch? E.g. I could use Streamz to perform some preprocessing (on GPU)
and make that available as a tf.data dataset. Can I just pass on the GPU
pointer, or would there be a GPU->CPU->GPU transfer?

~~~
betterwithbacon
PyTorch supports both `__cuda_array_interface__` and dlpack which allows
effectively sharing the GPU pointer without having to go through a host numpy
array. Tensorflow is actively working on adding support for dlpack here:
[https://github.com/tensorflow/tensorflow/issues/24453](https://github.com/tensorflow/tensorflow/issues/24453)

------
mrlucax
Does anyone know of any good resources to learn about data streams in general?
Some weeks ago I had to implemnt some streams (in nodejs) to upload a file to
s3 "on the fly", without storing the file locally and then uploading it, but I
culdn't wrap my head about the data stream concept.

~~~
yingw787
Have you reviewed data-intensive architectures? I've found that book quite
useful.

I'm just talking out my butt right now but I think fundamentally, a stream is
just a chunk of data lifted from persist into memory. I imagine a cursor
process traversing some bytes in a file, and then lifting some of those bytes
into memory, and sending that memory over network.

~~~
CameronNemo
> data-intensive architectures

I suspect you mean __Designing Data-Intensive Applications__, by Martin
Kleppman, but I am not entirely sure.

~~~
yingw787
Yeah, I just didn't want to type it out :P

------
SpaceManNabs
Nice. This would help solve the issues of deploying pytorch/tensorflow models
in a streaming environment (Apache Beam do not make it easy). I am curious as
to how performant this is when compared to Flink or Spark Streaming/Structured
Streaming. When you deploy deep learning models in pyspark, you run into
performance issues from serializing from jvm to python.

edit: just found out flink now has a python api! so include it along in the
comparison. not sure if the apache flink api also has serialization overhead.

------
kohlerm
I wonder how this compares to Apache Flink (or googles Data Flow) with regards
to scalability and fault tolerance for stateful computations. Checkpointing is
mentioned below, but is this really equivalent with what Flink can do?

------
yingw787
Very cool! I'm interested in learning more about the memory model of multi-
stage streaming systems. From my limited understanding of distributed data
systems, we went from saving files and checkpointing completely on disk a la
MapReduce to a disk/memory hybrid with Spark RDDs and lazy job execution (not
sure at all?) to pure streaming like Kafka. Could somebody please enlighten
me? I'd love to learn more.

~~~
sanderjd
I just started reading Streaming Systems:
[https://www.goodreads.com/book/show/34431414-streaming-
syste...](https://www.goodreads.com/book/show/34431414-streaming-systems).
Seems really great so far.

------
tarun_anand
Is this running in production anywhere?

~~~
peteradio
Define production. But probably not by a reasonable definition.

~~~
sdandu
We are investing heavily on streamz @ Nvidia as a python streaming library.
One of the core requirement is "Kafka checkpointing" for reliable end-end
pipelines which we implemented recently and commited to the trunk. With that
big milestone, we are one step closer to moving to production

