
Scalable Python Code with Pandas UDFs - bweber
https://towardsdatascience.com/scalable-python-code-with-pandas-udfs-a-data-science-application-dd515a628896
======
iblaine
Could this same problem be solved by using Apache Arrow, to convert to/from
pandas, and cut down the complexity in the process?

~~~
lacksconfidence
This does use Apache arrow, that's how the data is transferred from spark to
pandas and back.

~~~
iblaine
Great, glad this was confirmed as I am about to solve a similar problem and my
expectation is that Apache Arrow will be needed. So pyarrow = Apache Arrow.

------
mrbonner
This approach makes sense for predicting the data. Obviously, one could split
the data to run distributed prediction. But, how does this work for training
the linear model mentioned here with scikit-learn in a distributed fashion?

~~~
bweber
There are approaches for using Spark to distribute hyperparameter tuning and
cross validation: [https://databricks.com/blog/2016/02/08/auto-scaling-
scikit-l...](https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-
with-apache-spark.html)

However, for the example in this post, I would recommended using the logistic
regression provided by MLlib to scale up.

------
bweber
Non paywall link: [https://medium.com/p/scalable-python-code-with-pandas-
udfs-a...](https://medium.com/p/scalable-python-code-with-pandas-udfs-a-data-
science-application-dd515a628896?source=email-a80e1f69e782--
writer.postDistributed&sk=a6c9f3be9fe3904b79d3e443a13e6ab9)

