Hacker News new | past | comments | ask | show | jobs | submit login
Accidentally exponential behavior in Spark (heap.io)
75 points by drob on June 21, 2021 | hide | past | favorite | 33 comments

There are two lessons you could learn from this episode:

1. Use shallow trees and the clever workaround presented in the article.

2. Don't use Spark for tasks that require complex logic.

People should trace out the line of reasoning that leads them to use tools like Spark. It is convoluted and contingent - it goes back to work done at Google in the early 2000s, when the key to getting good price / performance was using a large number of commodity machines. Because they were cheap, these machines would break often, so you needed some really smart fault tolerance technology like Hadoop/HDFS, which was followed by Spark.

The current era is completely different. Now the key to good price / performance is to light up machines on-demand and then shut them down, only paying for what you use - and perhaps using the spot market. You don't need to worry about storage - that's taken care of by the cloud provider, and you can't "bring the computation to the data" like in the old days, removing one of the big advantages of Hadoop/HDFS. Because they are doing mostly IO and networking, and because computers are just more resilient nowadays, jobs rarely fail because of hardware errors. So almost the entire rationale that led to Hadoop/HDFS/Spark is gone. But people still use Spark - and put up with "accidentally exponential behavior" - because the tech industry is do dominated by groupthink and marketing dollars.

Exactly correct. I’ve got a post in the works called “Elegy for Hadoop” that traces the history back to the early 2000s and arrives at the present day where you can easily get on-demand instances with 500Gb of RAM and use it for only your application’s lifetime. If you want 1000Gb instead of 500gb it does not cost 5x it costs 2x, significantly invalidating the “need to use excess commodity hardware” premise of the distributed map reduce architecture.

Edit: I don’t mean to suggest that there is no reason to use Spark, but ~95% of the usage in industry is unnecessary now and should be avoided.

Is there anything you can say about Spark for Data Engineering (/ETL) ?

The most common reason for spark use today is ETL+DataLakes (ie., cloud object stores and ETL in/out).

It seems actual analysis is happening in fast databases that receive data from the object stores.

can anyone here comment on this paradigm?

I don't have much insight into spark but I've been using Dataflow/beam for ETL. Been a pretty good experience. follows the style of spinning up compute to process as needed then shutdown.

i predicted 99.99%

Is there an alternative you’d recommend?

Check out Frank McSherry’s COST (Configuration that Outperforms a Single Thread) and see if you are just better off with a single fat machine[1].

1. https://www.usenix.org/system/files/conference/hotos15/hotos...

Premise of the article is very true, but the comparison itself is very biased and dishonest.

Graph problems are famously hard to scale horizontally, and represent very small percent of what people use those big data systems for. Especially if you can fit the data in RAM...

Anyway, if you're able to run your workload on a single machine, then definitely do it.

I basically agree with you. Linked COST because of the premise and the upshot of the paper, which is totally valid.

Spark is still the best for stream processing use cases and if you have enough volume of data coming in something like spark is still the best for batch processing. '

>Spark is still the best for stream processing use cases

No, Flink is much better.

I've hit almost the exact same issue with Hive, with a somewhat temporary workaround (like this post) to build a balanced tree out of this by reading it into a list [1] and rebuilding a binary balanced tree out of it.

But we ended up implementing a single level Multi-AND [2] so that this no longer a tree for just AND expressions & can be vectorized neater than the nested structure with a function call for each (this looks more like a tail-call rather than a recursive function).

The ORC CNF conversion has a similar massively exponential item inside which is protected by a check for 256 items or less[3].

[1] - https://github.com/t3rmin4t0r/captain-hook/blob/master/src/m...

[2] - https://issues.apache.org/jira/browse/HIVE-11398

[3] - https://github.com/apache/hive/blob/master/storage-api/src/j...

It seems to me that the basic problem is that binary trees are the wrong data type. For instance, you can transform the tree to balance it:

    p1 AND (p2 AND (p3 AND notp4)) -> (p1 AND p2) AND (p3 AND notp4)
But the abstraction is specifying the order of operations unnecessarily. Using general trees, I think you avoid the need to transform the order of operations and "NOT" doesn't have to be a special case.

    ALL (p1 p2 p3 NOT(p4))
Is there any reason to choose binary trees for this? (Other than inertia).

> the abstraction is specifying the order of operations unnecessarily

That needs an "in SQL", the standard imperative language operator ordering has short-cut operations in there (a is null or a.value == true) etc.

In the code I work with, this actually sorts the conditions based on estimated selectivity[1] and type (long compares to constant are cheaper on a columnar data-set due to the SIMD, but string isn't etc).

> Is there any reason to choose binary trees for this?

The parse-tree does come off as binary because inserting logical parentheses makes it easier to tackle, because there are association rules which neatly go into a BinaryOp structure when dealing with operator precedence in parsing.

So it is easier to handle the parsing when you treat (a + (b + c) ) and (a / (b / c)) in similar fashion.

I won't make the same mistake again if I have to build a SQL engine, but this actually made the logical expression match the parse tree very closely and was a good enough generalization until the traversal time bugs[2] started to pop up.

[1] - https://github.com/apache/hive/blob/master/common/src/java/o... [2] - https://issues.apache.org/jira/browse/HIVE-9166

Why not just...

  val transformedLeftTemp = transform(tree.left)

  val transformedLeft = if (transformedLeftTemp.isDefined) {
  } else None

Good question, the simplified example doesn't make this clear.

The real implementation has a mutable `builder` argument used to gradually build the converted filter. If we perform the `transform().isDefined` call directly on the "main" builder, but the subtree turns out to not be convertible, we can mess up the state of the builder.

The second example from the post would look roughly like this:

  val transformedLeft = if (transform(tree.left, new Builder()).isDefined) {
    transform(tree.left, mainBuilder)
  } else None

Since the two `transform` invocations are different, we can't cache the result this way.

There's a more detailed explanation in the old comment to the method: https://github.com/apache/spark/pull/24068/files#diff-5de773... .

It looks like transform(tree.left) returns an Option[Tree] already (otherwise the code would not type check) so the entire if-else in the original code seems redundant and could be replaced with:

    val transformedLeft = transform(tree.left)

Just responded to the parent comment as well - there's an additional mutable argument to the real `transform` method so it's unsafe to invoke it directly without first checking if the tree is convertible.

It boggles my mind that the author wrote an entire long article based on this.

The rhetorical question saying that surely that weird refactor of two different functions into one, followed by calling that new, non-trivial function twice for no reason surely shouldn't affect performance.. He already lost me during the premise of the article.

What is so hard to understand here? There is some library code you can't immediately change because it belongs to upstream Spark. To illustrate the problem, ne simplifies the code to represent what the problem is.

Then, ne writes some code that works around the library bug by modifying the input losslessly into something that's more easily processed by the library.

Finally, ne patches the library bug and shares the patch.

All of this is also kinda fucking obvious to not just me, but a lot of people, so I'm having a really hard time grasping if you've mixed up the illustrative simplification with the actual code, or if you think that the best engineering approach is to always patch your environment bugs instead of modifying your input, or if you just don't have a Github account or for some other reason can't read the patch.

Between that patch and https://github.com/apache/spark/pull/24910 you can see why the code is what it is.

Post author here. Let me know if you have any questions!

Is there anything you can say here about why you're running this query in spark?

Supposing spark is your ETL machinery... would it not make more sense to ETL this into a database?

Definitely. One of the primary benefits we get out of Spark is the ability to decouple storage and compute, and to very easily scale out the compute.

Our main Spark workload is pretty spiky. We have low load during most of the day, and very high load at certain times - either system-wide, or because a large customer triggered an expensive operation. Using Spark as our distributed query engine allows us to quickly spin up new worker nodes and process the high load in a timely manner. We can then downsize the cluster again to keep our compute spend in check.

And just to provide some context on our data size, here's an article about how we use Citus at Heap - https://www.citusdata.com/customers/heap . We store close to a petabyte of data in our distributed Citus cluster. However, we've found Spark to be significantly better at queries with large result sets - our Connect product syncs a lot of data from our internal storage to customers' warehouses.

Would be nice if title said Apache Spark instead of just Spark, since there are other programs like Spark/Ada also called Spark.

There is also Java web application framework called Spark. Nowadays everyone just call it Sparkjava.

Good read - fwiw if this is your blog some of your links are broken and think they are local - https://heap.io/blog/%E2%80%9Dhttps://github.com/apache/spar...

Fixed, thank you for flagging!

Spark is this weird ecosystem of people who take absolutely trivial concepts in SQL, bury their heads in the sand and ignore the past 50 years of RDBMS evolution, and then write extremely complicated (or broken) and expensive to run code. But whatever it takes to get Databricks to IPO! Afterwards the hype will die down and everyone will collectively abandon it just like MongoDB except for the unfortunate companies with so much technical debt they can't extricate themselves from it.

There's certainly some of that and I have experienced project managers asking me to put 5GB datasets in spark... but there's definitely a set of problems where vertical scaling is a PITA and MPP basically generally breaks the SQL guarantees anyway, costs a milli, requires rewrites, etc.

When you want to process N+1 TB/PB its hard to throw standard relational approaches at it imo.

SQL is strings all the way down, testing the database itself is often shitshow...

While I agree that it can easily be "strings all the way down", as often the way folks make spark testable is only slightly more advanced than using views in a sql world. Add in an understanding of windowing functions, and some trivial assertions on expected query results go a long way.

spark is far more testable and composable than sql! and you even get static typing checking. plus i can read data from anywhere - local fs, s3, rdbms, json, parquet, csv... rdbms could not compete

Many (most?) DBs have no problem ingesting json, parquet, csv etc from S3. Some can query those formats without first ingesting them.

Is it best to just use spark.sql?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact