

Announcing Apache Spark 1.1 - rxin
http://databricks.com/blog/2014/09/11/announcing-spark-1-1.html

======
pwendell
Hey all - I'm the release manager for Spark 1.1. Happy to answer any questions
about Spark or this release.

~~~
hcrisp
Good news about the PySpark input format improvements. Does that also cover
reading complex Parquet datatypes into SchemaRDDs with their native datatypes?
When can we get a Databricks Cloud account (I'm already on the waiting list)?

~~~
ambrood
Don't the SchemaRDD already support Parquet? Although it'd be great if they
supported CSVs.

~~~
JoshRosen
There's work in progress to support importing CSV data as SchemaRDDs:

[https://issues.apache.org/jira/browse/SPARK-2360](https://issues.apache.org/jira/browse/SPARK-2360)
[https://github.com/apache/spark/pull/1351](https://github.com/apache/spark/pull/1351)

------
chollida1
Great to see this:

> This release adds significant internal changes to Spark focused on improving
> performance for large scale workloads.

We looked at Spark Streaming briefly when choosing which CEP engine to use. We
ended up not using it as its performance wasn't on par with other offerings.

I hope the performance improvements they've done carry over to the spark
streaming product.
[http://spark.apache.org/streaming/](http://spark.apache.org/streaming/)

We ended up using esper and a proprietary engine. The biggest problem with
some other streaming utilities is

1) performance

2) you can't step time when doing back testing. ie when replaying a day of
trading you'll often have signals that say something like be passive for the
next 10 seconds and then go to the bid for 10 seconds and then mid point and
if you still aren't done then cross the spread after an additional 10 seconds.

When backtesting you obviously don't play back in real time or it would take
6.5 hours to re simulate the day, you play back as fast as you can so you need
your CEP system to step time as it goes so it properly fires your time based
triggers.

I'm checking out memSql right now to see how it lets you step time whit
queries.

Startup idea.... It would be nice to have one unified way of doing real time
and batch processing of data. That way your real time trading engine can be
back tested in the same way it runs in production. I think/hope Spark is on
the way to solving this.

If anyone has any input on the best way to unify streaming vs batch event
processing please let me know!!!

~~~
jandrewrogers
It is quite possible to build a unified storage and execution kernel that will
allow you to simultaneously and seamlessly blend (1) streaming ingest path
processing, (2) online indexing/storage to disk at wire speed, and (3) fast
online query processing that immediately reflects both storage and ingest
path. Saturating a 10 GbE connection with this kind of workload on an ordinary
server is pretty simple if the system is designed correctly. There is nothing
technical that prevents processing 10 GbE streaming data concurrent with
storing that data to disk _and_ running queries across all of that data fast
enough to saturate the outbound network. I have designed and implemented
kernels that do exactly this.

When everything runs at wire speed and all data is always fully online, there
is no meaningful distinction between "batch" and "streaming".

That said, the reason you generally do not see anything in open source that
does it is that it requires a more sophisticated internal design and
architecture than you typically see in any of the open source databases or big
data platforms. Even a system designed to only run on a single machine (as
opposed to a parallel cluster) is probably 50kLoC of dense, complex, low-level
C++ just to get a basic kernel off the ground.

The main reasons you do not see a lot of startups doing this: The design of
these types of storage/execution engines is rare knowledge. You have to design
and implement most of your own algorithms and data structures -- few things
can be farmed out to template libraries or system call uses. Relatively few
developers are sufficiently skilled in C++ to successfully implement these
kinds of systems with these kinds of performance envelopes. The code base for
an MVP, excluding the expansive test infrastructure, is pretty huge, so you
need a lot of man-hours with the above skills and expertise.

The high initial cost of building these types of systems and the difficulty of
finding the necessary talent make them unattractive to investors as startups.
You often have to spend $10-20M before you are even going to know if there is
traction. That is a pricy bet.

~~~
lsh123
Everything above is a great summary of the CEP startup I was part of 7-8 years
ago. It took us 3+ years to get the product off the ground to the stage where
we started to see real sales. We've had to build custom code to handle
networking, messaging, scheduling, storage, etc. to get to the required
performance numbers.

------
brkyvz
AWESOME!

