Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Pathway – Build Mission Critical ETL and RAG in Python (NATO, F1 Used) (github.com/pathwaycom)
73 points by janchorowski 7 months ago | hide | past | favorite | 19 comments
Hi HN data folks,

I am excited to share Pathway, a Python data processing framework we built for ETL and RAG pipelines.

https://github.com/pathwaycom/pathway

We started Pathway to solve event processing for IoT and geospatial indexing. Think freight train operations in unmapped depots bringing key merchandise from China to Europe. This was not something we could use Flink or Elastic for.

Then we added more connectors for streaming ETL (Kafka, Postgres CDC…), data indexing (yay vectors!), and LLM wrappers for RAG. Today Pathway provides a data indexing layer for live data updates, stateless and stateful data transformations over streams, and retrieval of structured and unstructured data.

Pathway ships with a Python API and a Rust runtime based on Differential Dataflow to perform incremental computation. All the pipeline is kept in memory and can be easily deployed with Docker and Kubernetes (pipelines-as-code).

We built Pathway to support enterprises like F1 teams and NATO to build mission-critical data pipelines. We do this by putting security and performance first. For example, you can build and deploy self-hosted RAG pipelines with local LLM models and Pathway’s in-memory vector index, so no data ever leaves your infrastructure. Pathway connectors and transformations work with live data by default, so you can avoid expensive reprocessing and rely on fresh data.

You can install Pathway with pip and Docker, and get started with templates and notebooks: https://pathway.com/developers/showcases

We also host demo RAG pipelines implemented 100% in Pathway, feel free to interact with their API endpoints: https://pathway.com/solutions/rag-pipelines#try-it-out

We'd love to hear what you think of Pathway!




I am curious about your hosting; the Community plan notes "8 GB RAM - 4 cores "; is there some element to Pathway that is always hosted and would utilize this capacity - even for local deployments? Or is this just "Hey, if you want to play around on Pathway hardware, this is how much you can use"? This looks amazing, and I am wondering where "the rub" is :)


The main factor impacting the RAM requirement of the instance is the size of the data that you feed into it, especially if you need an in-memory index. (If you are curious about peak memory use etc., you can profile Pathway memory use in Grafana: https://github.com/pathwaycom/pathway/tree/main/examples/pro....)

One point to clarify is that "Pathway Community" is self-hosted, and the "8GB RAM - 4 cores" value is just a limit on the dimension of your own/cloud machine that the framework will effectively use. Currently, if you would like to get a "free" cloud machine to go with your project, we suggest going for "Pathway Scale" and reaching out through the #Developer Assist link - add a mention that you are interested in cloud credits. You can also go with 3rd party hosting providers like http://render.com/ who have a (somewhat modest) free tier for Docker instances, or reasonably priced ones like fly.io https://fly.io/docs/about/pricing/.


That’s a very interesting model, I don’t think I’ve seen that before. Is the rust engine that sits underneath Python shipped as a compiled executable, with a license check/capability limitation? EDIT: there is rust source code in the Pathway repository.

Another edit: there is license checking code in the rust source; it seems fair to ask users of your copyrighted code to abide by your limits, even if they are self-enforced, if that’s the implicit agreement in the sharing of the thing. Objectively.


It's absolutely great to see you've figured out the details of it! Indeed, the repository comes with a mix of Python and Rust which need to be built together. We trust our users not to re-build the package with altered parameters for production use (given the package build from sources is slightly non-trivial and takes an hour or two, one cannot really get this wrong by accident...). Then, for learning and non-production use, the BSL under which Pathway is distributed, allows you to do almost anything with the code.


I've built DE and AI solutions based on Pathway for multiple clients. It's robust and fast.


Thanks! Can you share some more details on the usecases and features used?


Congrats on the launch! If I understood it correctly, you also build vector indexes on the fly on live data? Curious - what usecases are you seeing for RAG on streaming data?


It's mostly still data in the unstructured realm. One case is "messaging" data (live indexing of communications, social, news updates, etc.). Another case is data which was not originally text but can be easily transcribed into text with an "adapter" - this includes live audio/video/screenshot streams. For now Pathway works with discrete event streams, so audio transcription needs to be done upstream - e.g. by pairing up a live captioning service with Pathway. On the use case side, it tends to be less about interactive question/answer and more alerting handled with pre-registered queries ("alert me when X happens").


Good old "Enterprise" NATO! Always good for a surprise


Some folks say it's not Fortune 100 but Fortune 1 ;-)


Definitively the NAFO gang and NATOwaves listeners!


If all the pipeline and the vector index is keep in memory... does Pathway still persist state somewhere?


(Adrian from the Pathway team here.) Indeed, everything is RAM-based, and persistence/cache relies on file backends. The precise backend to use is a code configuration parameter. S3 or local filesystem are the currently supported options. For documentation, see the user guide under Deployment -> Persistence.


Nice, thanks! I was reading https://pathway.com/developers/user-guide/deployment/persist.... If I understand correctly you persist both source data and internal state, including the intermediary state of the computational graph. And you only rely on the backend to recover from failures and upgrades. So if I want to clone a Pathway instance, I don't need to reprocess all source data, I can recover the intermediary state from the snapshot.

Is it the same logic for the VectorStoreServer? https://pathway.com/developers/user-guide/llm-xpack/vectorst...


For indexing operators, there is some flexibility regarding the amount of internal operator state that is persisted. Say, in a stream-stream join structure, it's actually often faster to rebuild its state from its "boundary conditions" than persist it fully. For vector indexes, it is necessary to persist rather more of the internal state due to determinism issues (the next time the index is rebuilt, it could come back different, and could give different approximate results, which is bad). Currently, the HNSW implementation which is the basis of VectorStoreServer is still not fully integrated into the main Differential Dataflow organization, and has its own way of persisting/caching data "on the side". All in all, this part of the codebase is relatively young, and there is a fair amount of room for improvement.


Great job on Pathway. It's impressive to see a Python tool for ETL and RAG tasks with such strong features. The Python API and Rust runtime for quick updates look interesting. Focusing on security and performance, especially with self-hosted RAG pipelines, is fantastic. Excited to see how this OSS repo grows.


Thanks for the kind words!


Looks nice! The examples on your site look easy to reproduce!

BTW. Super nice and clear website!


Looks awesome!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: