
Show HN: A Hands-On Guide on PySpark Coding and Best Practices - ericxiao251
https://github.com/ericxiao251/spark-syntax
======
quadrature
Great start, if you keep at it i'd love to see more of the advanced stuff. I
feel like we're all hitting problems like skew and it would be cool to have a
reference for dealing with those.

~~~
ericxiao251
Hey quadrature, thanks for the feedback! Would you be able to go into more
details about what skew you see :)?

In chapter 7 I go into some methods of fixing skewed data when performing
joins. This solved a majority of our skew problems, but we still see skew on
aggregates I believe. I am working on how to debug/find skews in a spark
application in Chapter 6, wanted to initially release this as I've been
procrastinating over 2 years to do so lol.

We have done more spark parameter optimizations but that helps after the data
skew have been resolved.

------
antisocial
I didn't find Apache Arrow in this repo. I would like to learn more about your
experience with using arrow, performance improvements and any lessons.

~~~
ericxiao251
I haven't looked into/keep up with Arrow much, but if I see fit, I can add
more stuff about it :)!

~~~
RBerenguel
I’ve given a very introductory talk about what Arrow “gives for free” when
using the right kind of UDF. It’s more fun in person, but with the references
at the end and the presenter notes I think you could get an idea of what you
will want to mention quicker than having to look at it from scratch. It’s
[here]([https://github.com/rberenguel/pyspark-arrow-
pandas](https://github.com/rberenguel/pyspark-arrow-pandas)), I hope you find
it useful!

~~~
ericxiao251
Oh awesome thanks for the resources! I will definitely see how i can
incorporate it into my guide :).

------
paulgb
This is great and much needed! Looking forward to chapter 6. Wish I had the
other chapters when I was getting started with spark.

~~~
ericxiao251
Yes I agree, that was the whole premise of the repo 2 years ago!

I'm glad you like it :)!

