This is something we probably should have spoken directly to in the article because of how popular spark is. The main reason we didn't is that we like Spark and don't feel it needs to be replaced. It doesn't seem to have most of the ecosystem problems we discuss because it's got a company behind it. My understanding of Spark is that it's designed to be used with different storage backends and I'm very curious to see what would happen if we got it talking to pfs. I think it could work very well because Spark's notion of immutable RDDs seems very similar to the way pfs handles snapshots.
Hope this clears a few things up and apologizes if I've mischaracterized Spark here I've only used it a little bit.
It operates very differently from Hadoop for us. SparkSQL allows third parties apps e.g. analytics to use JDBC/ODBC rather than going HDFS. And the in memory model and ease of caching data from HDFS allows for different use cases. We do most work now via SQL.
Combining Spark with Storm, ElasticSearch etc also permits a true real time ingestion and searching architecture.