
Things I Wish I'd Known About Spark When I Started - enigma_daniel
https://www.enigma.com/blog/things-i-wish-id-known-about-spark
======
nevi-me
Interesting points! I haven't used Spark in a while until I recently got back
into it.

I think data engineers 'should' add Spark to their toolset, even if just for
ETL. I found myself having to go through a few hundred CSV files and import
them into Oracle, Spark made it fast and easy.

Many of the pain points that you mention are universal whether using Scala or
Python. I wish I'd have known about .par years ago.

I also find checkpointing to help a lot, especially if using JDBC data
sources. If in reading in data that doesn't change, I read it into parquet,
and then change to reading from parquet.

Arrow has made pyspark more pleasant to work with, especially if you primarily
work on notebooks.

The one thing which I struggled with was latency on time-critical jobs. I'd
sometimes get a few minutes' pauses on jobs that should take under a minute to
run. I haven't checked how that's improved since Spark 2.0 though

------
tveita
"If you are responsible for generating parquet from another format—say you are
using PyArrow and Pandas for some large-scale migration—be conscious that
simply creating a single parquet file gives up a major benefit of the format."

I don't understand this part, it's not clear which major benefit you're giving
up, or what you should do instead. Is it saying not to convert these formats
to parquet? Or that you should create multiple parquet files to get the full
benefits?

~~~
enigma_daniel
Great question - and I agree that it is a bit unclear as to which benefit you
are giving up. The benefit would be using multiple parquet files for increased
performance.

~~~
rsanders
I believe that if you have a Parquet file meeting certain criteria, it's
directly parallelizable as multiple Spark partitions without any shuffling.
The splits would occur at Parquet row group boundaries, I believe.

See [https://stackoverflow.com/questions/27194333/how-to-split-
pa...](https://stackoverflow.com/questions/27194333/how-to-split-parquet-
files-into-many-partitions-in-spark/29819133#29819133),
[https://parquet.apache.org/documentation/latest/](https://parquet.apache.org/documentation/latest/),
etc.

Whether it's better to have multiple Parquet files or a single parallelizable
Parquet file is dependent on your environment and application. At my company,
we've tended to have a single row group per file (and one HDFS block per
file), in part due to historical reasons.

