
Spark vs. Snowflake: The Cloud Data Engineering (ETL) Debate - ibains
https://www.prophecy.io/blogs/spark-vs-snowflake-the-cloud-data-engineering-etl-debate
======
dalailambda
A quote from the article I would object to is "for large datasets and complex
transformations this architecture is far from ideal. This is far from the
world of open-source code on Git & CI/CD that data engineering offers - again
locking you into proprietary formats, and archaic development processes."

No one is forcing you to use those tools on top of something like Snowflake
(which is just a SQL interface). These days we have great open source tools
(such as [https://www.getdbt.com/](https://www.getdbt.com/)) which let you
write plain SQL that you can then deploy to multiple environments, perform
automated testing and deployment, and do fun scripting. At the same time,
dealing with large datasets in a spark world is full of lower level details,
whereas in a SQL database it's the exact same query you would run on a smaller
dataset.

The reality is that the ETL model is fading in favour of ELT (load data then
transform it in the warehouse) because maintaining complex data pipelines and
spark clusters make little sense when you can spin up a cloud data warehouse.
In this world we don't just need less developer time, those developers don't
have to be engineers that can write and maintain spark workloads/clusters,
they can be analysts who are able to do transformations and have something
valuable out to the business faster than the equivalent spark data pipeline
can be built.

~~~
ibains
Very valid points: 1) Agree that Snowflake is far easier to use than Spark. 2)
Agree that DBT is a great tool.

ETL workflows normally processing 10s of TBs and workflows with large and
complex business logic is the context. With Spark code, you can break down
your code into smaller pieces, see data flow across them, write unit tests,
and have the entire thing still execute as a single SQL query.

Don't large SQL scripts become really gnarly for complex stuff - nothing short
of magical incantations. I can't see data flow from a subquery for debugging
without changing code.

Prophecy as a company is focused on making Spark significantly easier to use!

------
ibains
Would love to get perspectives from HN community - did you decide between
Snowflake and Spark for data engineering? Which one did you pick, and why?

~~~
sails
Snowflake through and through.

I haven't used Spark, and have used Snowflake extensively, so I may be quite
biased. I imagine Spark has advantages in certain very high performance ML
workflows.

(see my HN profile for links to my writing on the topic of Snowflake)

I didn't really understand the blog post to be honest, and I also don't really
know how appropriate a direct comparison between Snowflake and Spark is?
Perhaps you can give your perspective.

