

Announcing Spark 1.3 - rxin
https://databricks.com/blog/2015/03/13/announcing-spark-1-3.html

======
eranation
This is a really great release, I'm excited to start playing with DataFrames!
One question if anyone from Databricks reads it - what about GraphX? Will it
also get the same level of attention that SQL, Mlib and Spark core got
recently? e.g. is adding support for Gremlin (or any other graph query
languages) on the roadmap? what about an R API for GraphX? Is that planned?
p.s. when is GraphX planned to exit Alpha?

in any case, great product, nice usage of Akka and Scala, and a very intuitive
API. I feel lucky to be working with it on a daily basis.

~~~
rxin
GraphX actually graduated from Alpha in Spark 1.2.

We have a few important improvements and changes to GraphX planned for 1.4,
including Java API, and possibly a Python API.

------
Wonnk13
I guess the DataFrame API needs to be spelled out for me. Does this mean RDDs
will be deprecated in the future if the new DataFrame is faster? As someone
who's gateway drug into programming was R it's been fun to watch data frames
grow across programming languages. I'm a huge fan of Python's Pandas library
and very interested in Spark.

~~~
pwendell
The DataFrame is an evolution of the RDD model, where Spark knows explicit
schema information. The core Spark RDD API is very generic and assumes nothing
about the structure of the user's data. This is powerful, but ultimately the
generic nature imposes limits on how much we can optimize.

DataFrames impose just a bit more structure: we assume that you have a tabular
schema, named fields with types, etc. Given this assumption, Spark can
optimize a lot of internal execution details, and also provide slicker API's
to users. It turns out that a huge fraction of Spark workloads fall into this
model, especially since we support complex types and nested structures.

Is the core RDD API going anywhere? Nope - not any time soon. Sometimes it
really is necessary to drop into that lower level API. But I do anticipate
that within a year or two most Spark applications will let DataFrames do the
heavy lifting.

In fact, DataFrames and RDDs are completely inter-operable, either can be
converted to the other. This means that even if you don't want to use
DataFrames you can benefit from all of the cool input/output capabilities they
have, even just to create regular old RDDs.

~~~
EdwardDiego
> It turns out that a huge fraction of Spark workloads fall into this model,
> especially since we support complex types and nested structures.

The first step of all my Spark tasks is "turn this RDD[String] into an RDD of
parsed JSON", or turning CSV into case classes.

What JSON parser will dataframes be using? I presume Jackson?

------
thethimble
Out of curiosity, is anyone using Spark in production? We're evaluating
whether we should invest in Hadoop or Spark. They're certainly not mutually
exclusive, but I would rather invest fully in Spark than have infrastructure
split between Spark and Hadoop.

~~~
gandalfu
I believe one of the key advantages of Spark over Hadoop is being able to run
the full stack on a small environment (single machine) and do all the coding
there without the need of a cluster just for development.

~~~
virmundi
This is why I like cascading [1]. It has a higher level API on Hadoop. It also
works in local mode with little change. I've actually used out to do
transformation work from local files (csv), join them into structured
documents and dump them into ArangoDB. I liked it so much I wrote a third
party library to work ArangoDB in Hadoop[2].

1 cascading.org/ 2
[https://github.com/deusdat/guacaphant](https://github.com/deusdat/guacaphant)

------
djcater
Thanks for the hard work. When can we get the long-awaited ORC file support,
including predicate-pushdown?

------
ampermad
Our experience with Spark has been horrendous. Very unstable. Marginal
improvements. Big hassle.

Would strongly advise you to consider hadoop. We also used storm and found it
to be much stable.

Databricks makes a lot of noise though.

~~~
kiyoto
Can you share the specifics of your experience (the cluster size/use
case/etc.)

I am partly asking this because you clearly feel strongly about this topic
(your account was created an hour ago, most likely to comment on this).

