We haven't developed the PySpark pipeline. It was given to us to be improved, which we did a whole rewrite to leave it more clean and understandable. We also tried a persistence switch to test if it was a better choice just in case a step failed we could resume from a prevoius one. I also had zero hands-on on PySpark and DuckDB. But yes, I was amazed at how far it was falling behind DuckDB. I wasn't expecting such a difference. Ah also this pipeline did indeed run on the cloud, but it was not posible to test it there, so the only choice was to run it locally.