

RDDs are the new bytecode of Apache Spark - ssaboum
https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

======
pacala
The API is great and very useful for Python coders. The Scala version uses
strings to identify fields. Haven't run the example code, but it seems to
essentially encode a form of run time type checks instead of compile time type
checks. Which is an odd API choice for Scala.

~~~
rxin
It is actually not possible to enforce type checks at compile time, due to the
dynamic nature of data (e.g. you can generate a DataFrame from JSON files
whose schema are automatically inferred by Spark, or generate a DataFrame by
loading a table in Hive). There is simply not enough type information
available at compile time, unless we rule out all these cool use cases.

There's one past attempt at making it more type safe by using macros. However,
there are a lot of caveats for that to be used in practice.
[https://github.com/marmbrus/sql-typed](https://github.com/marmbrus/sql-typed)

~~~
pacala
It's software, everything is possible :) It is fairly straightforward to add
compile time information via a simple case class definition. There is still a
dynamic type check that json matches the case class, but the spec of the
structure of the data is done in one place, not spread out arbitrarily over
the query code base.

~~~
rxin
That part is actually coming:
[https://github.com/apache/spark/pull/5713](https://github.com/apache/spark/pull/5713)

