
Introducing Spark Datasets - jonbaer
https://databricks.com/blog/2016/01/04/introducing-spark-datasets.html
======
srean
I would be quite interested to see how Dask, Spark and Dato's(or Graphlab's)
SFrames fare relative to each other. My hunch is none dominates the other
totally in performance and that each would have their sweet spot. Are there
any well done, executed_in_good_faith benchmarks available.

------
rkrzr
Does somebody know why these improvements could not be integrated into
DataFrames directly?

Why add another very similar layer of abstraction?

They mention:

"Unification of DataFrames with Datasets – due to compatibility guarantees,
DataFrames and Datasets currently cannot share a common parent class. With
Spark 2.0, we will be able to unify these abstractions with minor changes to
the API, making it easy to build libraries that work with both."

Anybody know what those "compatibility guarantees" are?

~~~
nl
Dataframes aren't strongly typed (while RDDs are).

It's difficult to add strong typing to an existing API.

I certainly prefer using DataFrames generally, but sometimes you need to
switch to RDDs because some operations are easier.

Hopefully Datasets bridge the gap well. I haven't used them yet, but the
examples look nice.

------
necrobrit
Really looking forward to this dropping! Writing complex data out of spark
jobs is super easy, but getting it back into strongly-typed-Scala has been a
big pain point.

The experimental release is a bit broken though, encountered a pretty big bug:
[https://issues.apache.org/jira/browse/SPARK-12714](https://issues.apache.org/jira/browse/SPARK-12714)

