
Distributed Filesystems for Deep Learning - jamesblonde
https://www.logicalclocks.com/why-you-need-a-distributed-filesystem-for-deep-learning/
======
manigandham
What does NVMe have do to with any of this? That's hardware storage
technology, and it's being compared to Python versions of deep learning
frameworks?

The whole point of having a file system is that programs are abstracted from
storage. What does native support even mean unless the file system somehow
hooks into the running code?

This makes no sense. It seems like a post trying to push HopsFS as more
performant, but comes across as inaccurate and misleading.

~~~
aogl
I agree, it appears as though NVMe was simply added in there to look
impressive and give another tick when in actual fact it's not relevant to this
layer of tech. This should be abstracted beyond hardware storage and be based
solely on the filesystem itself.

------
habitue
I am baffled why they have a matrix of what data science libraries they
support. They support reading and writing CSVs and Parquet (and, I assume,
arbitrary files as well). How is this specific HopsFS support for Pandas?
Pandas is dumping csv format to a file object which happens to write to
HopsFS.

This is a pretty weak sales pitch.

Edit: I'm not actually baffled. It's clearly SEO to catch people googling
around for "pandas support" or the like.

~~~
jamesblonde
Author of the post here. Part of the point of the matrix was to show that not
all distributed filesystems have native support for Python frameworks commonly
used for Data Science - Pandas/PySpark/TensorFlow. The other part is support
for NVMe hardware. Having been a professor for 17 years, i barely know what
SEO is.

~~~
dekhn
Can you explain what it means to have native support? Are you saying you hook
into Python's file io to implement remote IO? Or is the filesystem mounted via
FUSE or another technology? I Don't think that Pandas has any HopFS specific
code in it.

~~~
jamesblonde
"In Pandas, the only change we need to make to our code, compared to a local
filesystem, is to replace open_file(..) with h.open_file(..), where h is a
file handle to HDFS/HopsFS."

HopsFS is a drop-in replacement for HDFS. Here's some more details on native
HDFS connectors in Python by Wes McKinney
-[http://wesmckinney.com/blog/python-hdfs-
interfaces/](http://wesmckinney.com/blog/python-hdfs-interfaces/)

~~~
dekhn
I think what you are saying is that you're wire-compatible with HDFS, and
Pandas has access libraries (special file objects) that support HDFS.

------
xfactor973
Gluster/Ceph support NVMe. That matrix isn't right

------
regularfry
Not by that graph, you don't. Pick a number in the middle: 2^24x10^6 words.
That's 130TB, give or take. Sounds a lot, right? Not when you remember 16TB
drives and 15 bay RAID enclosures exist...

------
wskinner
Can anyone working on this project comment on how HopsFS compares to similar
alternatives? The post and the paper compare it to HDFS, but that's not really
fair - HDFS was not designed to handle huge quantities of small files, and if
you use it for that, you will of course not get good performance. Competitive
offerings would be things like Qumulo, Amazon EFS, Objective FS, perhaps
BeeGFS & Lustre, and so on.

~~~
jamesblonde
HopsFS is a derivative work (a fork) of HDFS - it is still 100% wire-
compatible with HDFS. It has distributed metadata (not metadata stored on the
heap of a single JVM). It also supports putting small files on NVMe storage -
[https://www.logicalclocks.com/millions-and-millions-of-
files...](https://www.logicalclocks.com/millions-and-millions-of-files-deep-
learning-at-scale-with-hopsfs/) It's also open-source. So, it works in places
HDFS works. HopsFS is, to the best of our knowledge, the only distributed
filesystem with distributed metadata (apart from Google Collosus - the hidden
jewel in their crown).

Lustre, BeeGFS are HPC filesystems, i believe. No data locality. Amazon EFS -
feels like a MapR clone, but they expose some tunings - higher latency for
higher throughput and larger clusters. We don't make that trade-off. I don't
know enough about Qumulo and Objective FS to comment on them.

~~~
notacoward
> HopsFS is, to the best of our knowledge, the only distributed filesystem
> with distributed metadata

Then you haven't done your homework, since Gluster has had that architecture
for a decade. And it supports real POSIX semantics, too - not just the HDFS
subset.

~~~
gnufx
Indeed, and (as notacoward will know) the HPC filesystems OrangeFS/PVFS2,
Lustre, BeeGFS, and GPFS all have -- or can have -- distributed metadata, and
POSIX semantics (or POSIX modulo locking).

~~~
notacoward
Correct. I didn't mean to exclude those, but they weren't part of the original
comparison. I didn't mention Ceph because somebody else already had, plus I'm
not a maintainer for that as I am for Gluster. ;)

It might also be useful to point out some semi-useful distinctions regarding
_levels_ of distribution. The PVFS2 lineage has had _fully_ distributed
metadata for even longer than Gluster. OTOH, both Lustre and Ceph have
separate metadata servers, more than one but still fewer than there are data
servers. That doesn't necessarily limit scalability, but it does increase
operational complexity. I'm honestly not sure about BeeGFS or GPFS; I don't
pay much attention to them since one is only fake open source and the other
doesn't even try. If we want to include proprietary offerings we could also
add EFS, ObjectiveFS, OneFS, PanFS, etc.

BTW, for those here who know me, at $dayjob I've switched from working on
Gluster to working on a more HDFS-like data store. (They chose not to call it
a filesystem because it isn't one, which I consider a refreshing bit of
honesty in a field full of pretenders.) It scales way beyond any of the things
mentioned here, but it's proprietary so I can't compare notes as much as I was
able to with Gluster. :(

~~~
jamesblonde
We have different interpretations of what we mean by "distributed system". I
mean a group of servers that provide a single service. Or more precisely,
metadata is a single distributed shared memory layer, where file systems
semantics are same regardless on which partition/shard of the metadata a
metadata operation is performed.

What Ceph and many others do is support partitioned metadata. One volume ==
one metadata partition. But cross partition metadata operations do not have
the same semantics as intra-partition metadata operations. So, moving a
file/dir between volumes (partitions) is not atomic - they do not solve the
_hard_ consensus problem. We do. Gluster is also not solving the consensus
problem. The end result is leaky filesystem abstractions. If you move a file
between two directories, it may go fast. But between two different directories
(across volumes), it can go very slow. Renaming a file in Gluster can cause it
to move between hosts.

~~~
notacoward
> Renaming a file in Gluster can cause it to move between hosts.

No. It does not. Within a volume, files will _always_ be renamed in place on
the replicas where they currently reside. There are other problems with rename
in Gluster, having to do with how the locations of files are tracked when
they're not where they're "supposed" to be according to the hashing scheme,
but movement of data as part of rename is not one of those problems.

"Across volumes" is not an interesting or relevant case, because the whole
point of separate volumes in Gluster (and most other distributed filesystems)
is to provide _complete_ isolation. You seem to have this idea that users
would want to have lots of volumes and still move data around between them. Is
there something weird about _your_ architecture that forces them to do that?
Small volumes are operationally complex and stand resources. Generally, you'd
have just one volume. Access control, quota, etc. could still be applied at
the namespace or directory level, below volumes. It worked rather well at the
world's largest Gluster installation, which I helped run for a year and a
half, and I believe most other installations are similar.

Please try to learn about other systems before you make wild claims about
them, lest your audience perceive them as deliberate lies to benefit your
business. I don't think your claims about Ceph are anywhere near accurate
either, but I'll leave those for one of my Ceph friends to address.

~~~
jamesblonde
My claim is exactly about what contributes a distributed namespace - i claim
it is a namespace that spans multiple hosts and requires (distributed)
consensus algorithms so that nodes that provide the metadata service
(operations like rename) can execute operations only in metadata. Volumes are
not just for isolation. They also partition the distributed namespace, and
make it easier to scale - but at a cost. Operations that cross volumes do not
solve the consensus problem, so they do not execute operations like move
atomically. Whether that is a copy, then delete or something else is not the
issue. They still do not provide a single distributed shared namespace
abstraction with atomic operations.

~~~
notacoward
"i claim it is a namespace that spans multiple hosts and requires
(distributed) consensus algorithms..."

That's a very unique definition of "distributed". To many in the field, "spans
multiple hosts" alone is sufficient. Some might refine it to preclude single
nodes with special roles or by adding requirements such as a single view of
the data etc., but it's a joke to add atomicity or implementation details such
as how a consensus algorithm is used. There are plenty of distributed data
stores that don't provide atomicity for all operations, and users are happy
with that. There are plenty that use leader election instead of consensus to
deal with consistency issues, and users are happy with that too. Inventing a
novel and highly specific definition to exclude them seems more than a bit
disingenuous. If we want to go that route, I'll point out that HopsFS is
misnamed because it's _not a filesystem_ according to commonly used
definitions. There are many hard problems that it wimps out on, that a real
filesystem must address.

"Operations that cross volumes do not solve the consensus problem, so they do
not execute operations like move atomically."

Again, you seem to be using "volume" in a very unique way. _Your architecture_
might misuse "volume" to mean some sort of internal convenience that's
practically invisible to users, but others don't. To most, a volume is a self-
contained universe, much like a virtual machine. Users are well aware that
moving data between either will not have the same semantics as moving data
within.

~~~
jamesblonde
My definition of distributed metadata for a FS is the same as you would have
for a distributed database. If I shard a database across many servers and do
not define semantics for transactions that cross shards, that is what you are
talking about. I am talking about a system that supports cross partition
transactions.

~~~
notacoward
Let's recap. You've gone from claiming that Gluster doesn't have distributed
metadata (untrue) to claiming that it copies data on rename (untrue) to
claiming that renames aren't atomic across partitions (untrue because such
partitions aren't even part of Gluster's architecture). Those are some very
mobile goalposts. You've made similarly untrue claims about Ceph, which does
have a sort of metadata partitioning but not the way you seem to think. Along
the way, you've tried to redefine terms like "volume" which have well
established domain specific meanings.

Such desperate attempts to disparage systems you obviously haven't even tried
to understand are not helpful to users, and are insulting to people who have
spent years solving hard problems to address those users' _actual_
requirements. Believe it or not, systems exist which don't work like yours,
and which prioritize different features or guarantees, and there's
demonstrated demand for those differentiators. If you want to learn about
those differences, to compare and contrast in ways that actually advance the
state of the art, please begin. I say "begin" because I see no evidence that
you've attempted such a thing. Does your academic employer know that you're
misinforming students?

~~~
jamesblonde
I disagree. I tried to define a distributed namespace for you, and you won't
accept my definition. And you don't accept that one of the reasons volumes
exist is to partition namespaces to make them easier to scale. Luckily for at
me, peer-reviewed conferences and journals do accept my definition.

------
jey
What's the relevance of NVMe? Is that I/O latency still relevant in the
presence of networking and scheduling overheads?

~~~
jamesblonde
If your V-100 GPU can process 150 images/sec, can S3 keep up? What if you now
have 10 GPUs or 100 GPUs? Now, HDFS can't keep up. But HopsFS can....

~~~
dekhn
Many people on these forums train directly from S3 and know how to get that
kind of throughput. Like any blobject store, the key is to use multithreaded
readers and high concurrency and set up your storage properly.

------
sgt101
I'm wondering if Kudu + Impala + RAPIDS + YARN +TonY might do instead... maybe
more support?

~~~
jamesblonde
What's RAPIDS?

How long will Impala last, post Hortonworks merger? With Hive at position 15,
and Impala at position 35, who do you think the merged entity will bet on?
[https://db-engines.com/en/ranking](https://db-engines.com/en/ranking)

~~~
sgt101
[https://news.ycombinator.com/item?id=18186392](https://news.ycombinator.com/item?id=18186392)
for RAPIDS

Impala was the wrong thing to posit in the list above - maybe SPARK SQL is
better as you can run that under YARN as well.

------
huntaub
What about Elastic File System?

~~~
jamesblonde
It's the same as s3 - no support for NVMe.

~~~
dekhn
That's a non sequitur- EFS is a way to mount filesystems via NFS. What EFS
does internally is kind of irrelevant- Amazon could implement it with NVMe, or
whatever technology that they build internally, and the client would see the
performance.

------
_zoltan_
we're using Spectrum Scale with its HDFS connector. while we're talking about
options, I thought I'd mention it as well.

~~~
gnufx
For what it's worth, you can do the same freely with at least OrangeFS,
Gluster, and Lustre if you must do things a Hadoop-ish way; I assume also with
Ceph.

------
xvilka
Problem of both HDFS and HopFS - they written in Java, which makes them slower
and eating much more memory than native filesystems like Ceph.

