
Large Scale Distributed Deep Learning on Hadoop Clusters - cjdulberger
http://yahoohadoop.tumblr.com/post/129872361846/large-scale-distributed-deep-learning-on-hadoop
======
duggan
Both this and Twitter Engineering's recent post[1] on HDFS make me wonder
whether HDFS is something a team would reach for in 2015.

I'm starting to read into the technologies in this area (i.e., I have not used
much of the Hadoop stack yet), and I haven't found a fundamental reason why
one would not base their batch processing on S3 (or your object store of
choice). Existing software appears to make assumptions about the storage
medium being a local hard drive.

Much of the challenge of HDFS appears to be around scaling the NameNode, and
provisioning capacity. S3 dispenses with these issues, and the only cost
appears to be throughput.

If software like Spark was modified to have a much more native approach to S3,
could HDFS be dispensed with entirely?

[1] [https://blog.twitter.com/2015/hadoop-filesystem-at-
twitter](https://blog.twitter.com/2015/hadoop-filesystem-at-twitter)

~~~
bradhe
Traditionally, using HDFS solved the data locality problem. Networks have
gotten MUCH faster, though.

~~~
jbooth
At every bigger company, teams wind up shipping a 128MB jar assembly to their
128MB data block, annihilating any locality gains.

