Hacker News new | comments | show | ask | jobs | submit login

I wonder if anyone with experience with Spark can comment / rebut this post: http://blog.explainmydata.com/2014/05/spark-should-be-better...



I use spark a lot and my experience has been quite the opposite. The queries I run against spark are billions of events and results are sub-second.

I could only speculate as to what this users issues were. One difference between hadoop and spark is that it is more sensitive in that you sometimes need to tell it how many tasks to use. In practice it is no big deal at all.

Perhaps the user was running into this- the data for a task in spark runs all in memory, whereas hadoop will load and spill to disk within a task. So if you give a single hadoop reducer 1TB of data, it will complete after a very long time. In spark if you did this you would need to have 1TB of memory on the executor. I wouldn't give an executor/JVM anything over 10GB. So if you have lots of memory, just be sure to balance it with cores and executors.

I have seen spark use up all the inodes on systems before. A job with 1000 map and 1000 reduce tasks would create 1M spill files on disk. However that was on an earlier version of spark and I was using ext3. I think this has since been improved.

For me spark runs circles around hadoop.


>The queries I run against spark are billions of events and results are sub-second.

This is interesting, I haven't gotten Spark to do anything at all in less than a second. How big is this dataset (what does each event consist of)? How is the data stored? How many machines / cores are running across? What sort of queries are you running?

>I could only speculate as to what this users issues were.

I'm the author of the above post and unfortunately I can also "only speculate" what my issues were. Maybe Spark doesn't like 100x growth in the size of an RDD using flatMap? Maybe large-scale joins don't work well? Who knows. The problem, however, definitely doesn't seem to be anything from the tuning guide(s).


> How big is this dataset (what does each event consist of)?

Standard clickstream data, maybe 50-ish parameters per event.

> What sort of queries are you running? > How is the data stored?

Depends on the use-case. For sub-second adhoc queries we go against bitmap indexes. Other queries we uses RDD.cache() after a group/cogroup and answer queries directly from that. For other queries we go hit ORC files. Spark is very memory sensitive compared to hadoop, so using a columnar store and only pulling out the data that you absolutely need goes a very long way. Minimizing cross-communication and shuffling is key to achieving sub-second. It's impossible to achieve that if you're waiting for TB of data to shuffle around =)

> How many machines / cores are running across?

Depends on the use case. Clusters are 10-30 machines, some we run virtual on open stack. We will grow our 30 node cluster in 6mo.

> Maybe Spark doesn't like 100x growth in the size of an RDD using flatMap

You may actually just need to proportionally scale the number of partitions for that particular task by the same amount. Also when possible use mapPartitions, it is very memory efficient compared to map/flatMap.

> Maybe large-scale joins don't work well

Keep in mind that what ever happens per task happens all in memory. For large joins I created a "bloom join" implementation (not currently open source =( ) that does this efficiently. It takes two passes at the data, but minimizes what is shuffled.


> For sub-second adhoc queries we go against bitmap indexes.

Did you implement the index yourself? How many events survive the initial filtering?


> Maybe Spark doesn't like 100x growth in the size of an RDD using flatMap?

I'd be interested to hear more about your use case and the problems you encountered. It's possible that you need to do some kind of .coalesce() operation to rebalance the partitions if you have unbalanced partition sizes.


Well, the RDD is initially partitioned using a RangePartitioner over a dense key space of Longs. Each element is then expanded ~100x (each object is significantly smaller than the original value). So the total memory footprint and skew of the expanded RDD shouldn't, theoretically, be a problem.


My experience using spark has also been nothing but positive. I recently built a similarity-based recommendation on Spark (https://github.com/evancasey/sparkler), and found it to be significantly faster than comparable implementations on Hadoop.

subprotocol's point about specifying the number of tasks/data partitions to use is true - you need to manually set this in order to get good results even on a small dataset. However, other than that, spark will give you good results pretty much out of the box. More advanced features such as broadcast objects, cache operations, and custom serializers will further optimize your application, but are not critical when first starting out as the author seems to believe.


I'm really curious to find out in what situations Spark actually works for people. So far, no one in my lab seems to be having a terribly productive time using it. Maybe it's better for simple numerical computations? How large are the datasets you're working with?


I did most of my benchmarking with the 10M MovieLens Dataset http://grouplens.org/datasets/movielens/ consisting of 10 million movie ratings on 10,000 movies from 72,000 users. So not necessarily "big data", but big enough to warrant a distributed approach.

Spark is ideally suited for iterative, multi-stage jobs. In theory, anything that requires doing multiple operations an a working dataset (i.e. graph processing, recommender systems, gradient descent) will do well on Spark due to the in-memory data caching model. This post explains some of the applications Spark is well-suited for: http://www.quora.com/Apache-Spark/What-are-use-cases-for-spa...


So the central piece of data is something like a 10 million element RDD of (UserId, (MovieId, Rating))? If so, it sounds like that data would fit into a single in-memory sparse array, how does Spark's performance compare with a local implementation?

By comparison, I'm trying (and failing) to work with RDDs of 100+ billion elements.


What is the difference between Spark and Storm? They both seem like "realtime compute engines"

*edit - from what I can see Spark is a replacement for hadoop (offline jobs), where Storm deals with online stream processing


Storm is generally more of a dataflow "per event" real/near time computation system (with each event flowing through N spouts and bolts) whereas Spark is more of an in-memory data processing system (with Spark streaming being the "equivalent" to the storm system).


Spark's abstractions are indeed really nice; a Spark job is much more readable than the same thing expressed in raw MapReduce, as that post acknowledges at the end.

I can't really comment on or rebut "my code runs slow and I don't know why", except to say that Spark performance has been great when I've used it. But yeah, if the abstraction should fail (and again all I can say is it hasn't for me) then I can imagine it's not much fun to debug performance and there's no distributed profiler (though I think you'd be in much the same boat with vanilla Hadoop).


> that Spark performance has been great when I've used it.

Can you say more about your use case? What sort of data did you start with? What did you do with it? How large was the cluster you were running on?


Not sure how much I should say. Advertising analytics. Fairly small cluster (<100). More for ad-hoc theory testing rather than anything regular.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: