Hacker News new | past | comments | ask | show | jobs | submit login
Streamz: Python pipelines to manage continuous streams of data (streamz.readthedocs.io)
211 points by aeontech on March 27, 2020 | hide | past | favorite | 24 comments

We at NVIDIA are heavily invested in developing streamz. We’ve made streamz compatible with RAPIDS cuDF, which means streaming jobs in Python can be GPU-accelerated now.

Folks interested in GPU-streaming data using 100% native Python (no Spark setup needed is a big win) can look up the Anaconda package called custreamz, which is part of NVIDIA RAPIDS open-source GPU Data Science libraries.

Streaming feature-parity-wise, we’ve made Kafka integration robust, and added checkpointing to streamz which is a must-have feature for production streaming pipelines.

I’d be happy to answer any questions you guys may have, and would love to have more people use streamz and contribute if possible.

As for scaling for big data streaming, streamz works well with Dask, so GPU-accelerated streaming in distributed mode is on! :)

Is streamz/CuDF/CUDA a good option for soft real-time processing of high bandwidth sensor data?

For instance, I have electronics board grabbing 10,000 analog voltage samples a second on multiple channels. I want to: - ingest chunks of that data at 100ms intervals (so 1000 samples per interval) - update some plots in a PyQt-GUI at the 100ms interval - and once a second compute a FFT of 10-20 seconds of data for one or more channels of data.

I have basic code working using numpy, numpy.roll().

I am debating whether or not I should be using C#/.NET or C++ instead of python for this... there are strong speed advantages.

Numpy is fast enough to handle what I need to do right now... but if I want to up my data rate by 10X (10kHz to 100kHz) I think I am going to run into hard limitations with python and numpy and PyQt GUI updates.

Thoughts and suggestions are much appreciated :-)

How does it integrate into other GPU-accelerated frameworks such as TensorFlow or PyTorch? E.g. I could use Streamz to perform some preprocessing (on GPU) and make that available as a tf.data dataset. Can I just pass on the GPU pointer, or would there be a GPU->CPU->GPU transfer?

PyTorch supports both `__cuda_array_interface__` and dlpack which allows effectively sharing the GPU pointer without having to go through a host numpy array. Tensorflow is actively working on adding support for dlpack here: https://github.com/tensorflow/tensorflow/issues/24453

Good summary @chinmaychandak. One thing to add - streamz is a native python streaming and RAPIDS cuDF(https://rapids.ai/) is a GPU accelerated data science toolkit. cuStreamz acts as a bridge b/w streamz & Rapids cuDF with native kafka integration to GPU accelerate end-end streaming use cases.

Compared to a “normal” cuda workflow is there still a kernel compilation step? And does this streaming make use of texture memory?

Would love a link to any learning resources you know of (especially if they have architecture diagrams)! I am on my phone but I’ll look more this weekend and see what I can find and any input is appreciated!

cuStreamz defers its computation down to cuDF which uses a combination of ahead of time compiled kernels via libcudf and just in time compiled kernels via Numba, CuPy, and Jitify for its execution.

As far as I know none of the above technologies use texture memory.

The RAPIDS docs site has an overview presentation that is a good source of architecture diagrams and would likely be a good start to further reading: https://docs.rapids.ai/overview

Does it work on AMD or Intel GPU's?

No, CUDA only.

Does anyone know of any good resources to learn about data streams in general? Some weeks ago I had to implemnt some streams (in nodejs) to upload a file to s3 "on the fly", without storing the file locally and then uploading it, but I culdn't wrap my head about the data stream concept.

Have you reviewed data-intensive architectures? I've found that book quite useful.

I'm just talking out my butt right now but I think fundamentally, a stream is just a chunk of data lifted from persist into memory. I imagine a cursor process traversing some bytes in a file, and then lifting some of those bytes into memory, and sending that memory over network.

Yes, however I think there's a lot more complexity in modern steaming architectures: event sourcing, concurrency, eventual consistency, pub-sub, queueing, event handlers, microservices, stateless messages, data lakes - to name a few. I would also be interested in a resource that tackles major concepts here.

> data-intensive architectures

I suspect you mean __Designing Data-Intensive Applications__, by Martin Kleppman, but I am not entirely sure.

Yeah, I just didn't want to type it out :P

Check out dataflows in Composable for a quick way to implement these data streams. https://docs.composable.ai/en/latest/03.DataFlows/01.Overvie...

you can also use the upload multipart feature of s3. Basically, you buffer until you have a sizeable chunk of the incoming stream, then upload that chunk; iterate. At the end you tell s3 to concatenate the parts.

I wonder how this compares to Apache Flink (or googles Data Flow) with regards to scalability and fault tolerance for stateful computations. Checkpointing is mentioned below, but is this really equivalent with what Flink can do?

Nice. This would help solve the issues of deploying pytorch/tensorflow models in a streaming environment (Apache Beam do not make it easy). I am curious as to how performant this is when compared to Flink or Spark Streaming/Structured Streaming. When you deploy deep learning models in pyspark, you run into performance issues from serializing from jvm to python.

edit: just found out flink now has a python api! so include it along in the comparison. not sure if the apache flink api also has serialization overhead.

Very cool! I'm interested in learning more about the memory model of multi-stage streaming systems. From my limited understanding of distributed data systems, we went from saving files and checkpointing completely on disk a la MapReduce to a disk/memory hybrid with Spark RDDs and lazy job execution (not sure at all?) to pure streaming like Kafka. Could somebody please enlighten me? I'd love to learn more.

I just started reading Streaming Systems: https://www.goodreads.com/book/show/34431414-streaming-syste.... Seems really great so far.

I would add that Kafka is a distributed commit log. Streaming is one of the applications. Spark also supports streaming. One should distinguish between the compute model like MapReduce, RDD etc and Data Model like streaming.


Is this running in production anywhere?

Define production. But probably not by a reasonable definition.

We are investing heavily on streamz @ Nvidia as a python streaming library. One of the core requirement is "Kafka checkpointing" for reliable end-end pipelines which we implemented recently and commited to the trunk. With that big milestone, we are one step closer to moving to production

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact