Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We haven't developed the PySpark pipeline. It was given to us to be improved, which we did a whole rewrite to leave it more clean and understandable. We also tried a persistence switch to test if it was a better choice just in case a step failed we could resume from a prevoius one. I also had zero hands-on on PySpark and DuckDB. But yes, I was amazed at how far it was falling behind DuckDB. I wasn't expecting such a difference. Ah also this pipeline did indeed run on the cloud, but it was not posible to test it there, so the only choice was to run it locally.


Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: