Are there statistical differences in slumping rate between pure-software startups vs domain-specific startups that use software as a tool? Numbers, please. Thanks.
Hi Jeff, congrats on your work. One question of general interest for buddying big data scientists and engineers: do you think from your position that Spark is going to replace Hadoop in the coming future or they will occupy different niches in the market? Thanks.
I think that the Hadoop ecosystem has just expanded. Spark is now a key part of that ecosystem.
Hadoop's MapReduce implementation is arguably quickly becoming obsolete with so many more powerful platforms beating it on several dimensions (not just Spark).
That said, if your interest is machine learning specifically, I can't sell Spark hard enough. Functional programming is so critical for modern, large-scale machine learning. Spark is an absolute revelation for the machine learning developer. It's not the only stack; the PyData stack is totally worthy of study and use. But for big data machine learning, Spark is as perfect as anything has ever been.
If you're really interested in learning more about how Spark, Scala, and functional programming come together in a machine learning system, I'm writing a book on reactive machine learning: http://www.reactivemachinelearning.com/ In it, I'm trying to cover the how and why of different tools, with a focus on Scala, Spark, and Akka.
I use Spark in conjunction with many Hadoop ecosystem mainstays: YARN, HDFS, etc. Hadoop mapreduce can be swapped out for Spark, but many great things beyond that have stemmed from the Hadoop project.
Yep, our Spark deployment, like many others, uses YARN and HDFS. EMR has done some great work to make YARN a great target for deployment of jobs using various technologies.
I'm very much not against the Hadoop ecosystem. The ecosystem represents very real progress for data infrastructure. But Hadoop MapReduce is just not what people should be using to build machine learning jobs at scale in 2015.
Spark makes great use of the Hadoop ecosystem, and I'm primarily interested in future innovations in the big data space that try to work with the Hadoop ecosystem instead of trying to supplant it. Modularity and composability benefit us all.