

An in-memory distributed Posix-compatible file system - shafiee01
http://bshafiee.github.io/BFS/

======
dsr_
The standalone mode is boring -- it's a ramdisk, about the same as you get
from tmpfs.

The standalone_swift mode is interesting -- it's a ramdisk with eventual
commit to disk -- but I don't see performance figures or essential config
details, like how often it flushes.

The distributed mode is surely cool for some use cases, but it doesn't seem to
commit to disk and it neither replicates nor works in parallel.

~~~
shafiee01
1-Well if you don't want to use a backend storage what else do you expect?

2-It will flush as soon as you call flush or close your file. It's not
necessarily disk and can be any configured backend (right now only swift is
supported; however, it's easy to develop other backends as well)

3-Replication is done in the backend storage. Once your data is flushed it's
replicated to your desire in Swift, Amazon or whatever backend you are using.
What do you mean 'nor works in aralllel'?

~~~
notacoward
If the back end is eventually consistent, how can you claim POSIX compliance
on the front end? User writes and fsyncs, BFS node goes down, somebody tries
to read. What will they get? Who the heck knows? The only in-memory copy is
gone, and the back end you've chosen doesn't guarantee that you'll get the
most recently written data on the next read.

It's perfectly OK if you decide on a different consistency model, but then you
can't say you're POSIX compliant (actually you can't anyway for legal reasons
but that's another topic).

~~~
shafiee01
If you flush a file it will be in the backend; Therefore, even if a node goes
down, another node will be responsible for that. There will be obviously a
delay for the recovery. In POSIX if your buffer is not flushed there is no
guarantee that you will get the latest version neither.

Yeah, maybe I should totally get rid of that "POSIX" compatible thing. I had
that in mind because I expose my fs interface through fuse library and
supposedly they cover POSIX.

~~~
notacoward
> In POSIX if your buffer is not flushed there is no guarantee that you will
> get the latest version neither.

That's not true. If it's not flushed there's no guarantee that it will be
_durable_ (i.e. on stable storage, will survive a crash of the entire system).
However, there is a guarantee that it will be _consistent_.

[http://pubs.opengroup.org/onlinepubs/009695399/functions/wri...](http://pubs.opengroup.org/onlinepubs/009695399/functions/write.html)

> After a write() to a regular file has successfully returned:

> Any successful read() from each byte position in the file that was modified
> by that write shall return the data specified by the write()

Even more specifically, and relevantly to the current point...

> If a read() of file data can be proven (by any means) to occur after a
> write() of the data, it must reflect that write(), even if the calls are
> made by different processes.

Many POSIX requirements related to what the database folks would call ACID are
hard to meet in a distributed file system, but they are nonetheless
requirements. This particular requirement implies that if a node fails but the
system as a whole remains up, then the last write must be _immediately_
available. It could be met, for example, by doing your own synchronous in-
memory replication. However, if you rely on pinging the data through a backing
store that only provides eventual consistency, then the requirement is not
met.

------
notacoward
The obvious, comparison for standalone mode would be to a regular file system
built on a ramdisk. The obvious comparison for network mode would have been
Plain Old NFS with a ramdisk on the server. Both comparisons are missing.
Also, this statement is flat-out untrue.

> IOZONE performs all of its operations on one file

Getting iozone to use multiple files and threads is trivial (-t option), and
the results for that would be much more interesting than those for a single
file/thread. Unfortunately, such benchmark errors are _extremely_ common. I
don't fault the author so much as his advisor, who seems to have a background
in computer networking (hence the strong focus in this work on alternative
network interfaces) but none in storage. My advice to the author would be to
find a more appropriate advisor for this project.

~~~
shafiee01
1- This is just a preliminary evaluation. I just made the repository public 3
hours ago. I am planning so many experiments including the ones you mentioned.

2- As I said more benchmark including multiple files with iozone, postmark are
coming. This is just to give viewer a sense. P.S I am happy with my supervisor
:) thanks for your recommendations though ;)

~~~
notacoward
As you continue your testing, I strongly recommend that you use different
numbers of threads/files and queue depths, as well as different file and I/O
sizes. If you want to make scalability claims, you'll also need to test with
multiple clients and servers. It's also good practice to show the _exact_
iozone commands used, or (even better) fio command line and profiles.

More generally, I think it would be very helpful if you could be more explicit
about how your system works. For example...

* What are your consistency/durability guarantees? I mean _really_ , since they're clearly not POSIX.

* How do you detect and respond to faults?

* How would you describe the system in CAP terms?

* How do you reconstruct file system state from the backing-store object state after a node failure or cold restart?

* How do you decide which node should hold a file? How do you re-make that decision when servers are added or removed? Can they be?

These are the decisions that every distributed file system must make (and an
explicit comparison to others sure wouldn't hurt). While it might seem like a
lot, you did say this is for a master's thesis and it's stuff your advisor
should already have required. It will certainly come up when you go to defend
that thesis. If this truly advances the state of the art in some way, you
should already know exactly how and be able to explain it.

Disclaimer: I work on GlusterFS. On my own, I've worked on two toy projects
(CassFS and VoldFS) that explored ideas in how to build a distributed file
system on top of another kind of data store. I might be able to help you, if I
knew more about which issues you'd already addressed and which remain.

~~~
shafiee01
Yes, I plan to run experiments with different numbers of threads/files and
many other factors. Right know I only have access to a cluster of 7 old
machines. That's the best I can do and for the hadoop case I used all of the
machines. Sure thing I will provide all of the commands that I used. I just
made this project public to get this type of comments. I have not written or
explained anything yet. So there are A LOT to explain like the questions you
raised. I will try to answer them here as well: Here is a big sketch of the
system: Nodes use fuse to provide a traditional file system view to
applications. Zookeeper is used for consensus and it keeps track of what file
(name) each node is holding. Swift (or any other persistent storage) is used
as the backend. Nodes communicate through TCP or Zero_networking to each other
for remote IO operations. * When a file is flushed my consistency/durability
guarantees are whatever the backend storage is. If swift, then it's swift's
consistency/durability guarantees. While in memory (not flushed or closed)
it's consistent because there is only one copy of file at the host node but
not durable because if that node dies it's gone. _if a node fails there is a
master (elected with zookeeper) which will assign the files that node was
holding to other nodes (assuming that nodes has flushed it 's files to the
backend). Failure in zookeeper and swift are out of my project scope. _ As
soon as the system starts, it goes through a zookeeper master election and
then master distributes files in the backends to the live nodes. * right now
it's just a silly algorithm (the most free node) however, it can/should be
changed to a more advanced mechanism which probably considers load and many
other factors as well.

I am very happy to answer your questions and I really enjoyed answering them.
I am sure this is the type of question I will get in my defence session as
well. And that's why I post this in the news. :)

Awesome, it will be great if other people can contribute in my project. There
are many things left to do and a tons of space for improvement. Maybe we can
discuss more through email.

------
virtuallynathan
Are you planning to benchmark this on 10G/40G networks? looks like on first
glance it is limited by the available network bandwidth.

~~~
shafiee01
I wish if I could :) but right now in our lab I don't have access to such
equipments. Honestly, that was the plan because using 10/40G networks will
significantly reduce the delay in remote operations as well as increasing the
throughput in remote operations.

~~~
notacoward
Might I suggest a round of testing on Amazon or Rackspace? They both have
10Gb-equipped instance types with SSDs and everything, for a few bucks per
cluster-hour. That shouldn't break even an academic budget. You might not be
able to do your special ZERO networking in that environment, but you'd be able
to explore other interesting corners of the performance space and your
experiments would be reproducible by others.

~~~
shafiee01
Yes, I will talk about that with my supervisor.

