Folks interested in GPU-streaming data using 100% native Python (no Spark setup needed is a big win) can look up the Anaconda package called custreamz, which is part of NVIDIA RAPIDS open-source GPU Data Science libraries.
Streaming feature-parity-wise, we’ve made Kafka integration robust, and added checkpointing to streamz which is a must-have feature for production streaming pipelines.
I’d be happy to answer any questions you guys may have, and would love to have more people use streamz and contribute if possible.
As for scaling for big data streaming, streamz works well with Dask, so GPU-accelerated streaming in distributed mode is on! :)
For instance, I have electronics board grabbing 10,000 analog voltage samples a second on multiple channels.
I want to:
- ingest chunks of that data at 100ms intervals (so 1000 samples per interval)
- update some plots in a PyQt-GUI at the 100ms interval
- and once a second compute a FFT of 10-20 seconds of data for one or more channels of data.
I have basic code working using numpy, numpy.roll().
I am debating whether or not I should be using C#/.NET or C++ instead of python for this... there are strong speed advantages.
Numpy is fast enough to handle what I need to do right now... but if I want to up my data rate by 10X (10kHz to 100kHz) I think I am going to run into hard limitations with python and numpy and PyQt GUI updates.
Thoughts and suggestions are much appreciated :-)
Would love a link to any learning resources you know of (especially if they have architecture diagrams)! I am on my phone but I’ll look more this weekend and see what I can find and any input is appreciated!
As far as I know none of the above technologies use texture memory.
The RAPIDS docs site has an overview presentation that is a good source of architecture diagrams and would likely be a good start to further reading: https://docs.rapids.ai/overview
I'm just talking out my butt right now but I think fundamentally, a stream is just a chunk of data lifted from persist into memory. I imagine a cursor process traversing some bytes in a file, and then lifting some of those bytes into memory, and sending that memory over network.
I suspect you mean __Designing Data-Intensive Applications__, by Martin Kleppman, but I am not entirely sure.
edit: just found out flink now has a python api! so include it along in the comparison. not sure if the apache flink api also has serialization overhead.