

Pfs v0.2 – A Git-like distributed file system for the Docker ecosystem - jdoliner
http://pachyderm-io.github.io/pfs/#whats-new-in-v02

======
seliopou
How does this compare to Irmin[0][1][2], a similar system written by the guys
working on Mirage? On first glance the one major difference I see is that
Irmin supports merges, while Pfs does not. Also, what exactly is specific to
Docker here?

[0]: [https://github.com/mirage/irmin](https://github.com/mirage/irmin)

[1]: [http://openmirage.org/blog/introducing-
irmin](http://openmirage.org/blog/introducing-irmin)

[2]: [http://openmirage.org/blog/introducing-irmin-in-
xenstore](http://openmirage.org/blog/introducing-irmin-in-xenstore)

~~~
jdoliner
Hi,

I wasn't previously familiar with Irmin so I can't say this with certainty but
it looks like the data models of pfs and Irmin are very similar. They are
targeting slightly different use cases though so I expect some performance
details to be different. Irmin bills itself as a database so it likely has
much faster point access than pfs and more interesting indexing options. Pfs
on the other hand is designed to handle much larger filesizes than Irmin.
(Irmin doesn't list a limit but people don't normally put TB size rows in
their databases.)

The answer to your second question is also a big difference between pfs and
Irmin. Although it's currently unimplemented pfs is designed to leverage
docker for its MapReduce implementation. Rather than implementing a Java class
to define a MapReduce job as one does in Hadoop, in pfs you create a docker
image with a webserver inside to define you MapReduce job. This makes it
trivial to include whatever dependencies you want in your distributed
computations.

Hope this clears some stuff up.

------
gargantian
I'm not seeing what this has to do with Docker, other than the planned future
ability to run M/R jobs inside Docker containers. That's an interesting idea,
but I don't think enforcing Docker has any advantages; if you have a clean and
simple API like REST or even plain pipes, you open yourself up to a whole new
world of composability without needing something relatively heavyweight like
Docker.

Additionally, you seem to be leaning on CoreOS at the moment. That seems a
dangerous dependency considering the CoreOS/Docker relationship.

~~~
jdoliner
> if you have a clean and simple API like REST or even plain pipes, you open
> yourself up to a whole new world of composability

Totally agree with this and that's one of the core tenants of our API design.
We should probably be more clear about how docker fits with pfs. Our APIs are
all designed as RESTful services to allow for composability, however we want
to take a batteries included but removable approach. In our case Docker is a
battery. We want it to be there so that users have a really easy primitive to
implement M.R jobs with. But we recognize it might not be for every user so we
also want to allow people to put anything they want there. I think the easiest
way would be just letting people pass an arbitrary endpoint to be used in an
M/R job.

> Additionally, you seem to be leaning on CoreOS at the moment. That seems a
> dangerous dependency considering the CoreOS/Docker relationship. I'm hopeful
> that both of these companies commitment to a batteries included but
> removable approach will make leveraging both ecosystems a realistic option.
> I agree that it would be a pain to have to pick one.

~~~
gargantian
> In our case Docker is a battery.

Am I correct in assuming this means I can (eventually) use PFS without any
dependency on Docker? In particular, I'm interested in knowing whether I can
expect to be able to run PFS _contained_ within an arbitrary unprivileged
container and use my preferred orchestration around it. Or, is it the goal of
PFS to take over the orchestration plane, or require CoreOS's? Or, something
else?

> I think the easiest way would be just letting people pass an arbitrary
> endpoint to be used in an M/R job.

I love that idea.

~~~
jdoliner
We very much don't want to take over the orchestration plane. We'd much rather
interoperate nicely with existing orchestration systems. We just want to add
the ability to store and access large datasets within these existing systems.

Your ideal of being able to use and arbitrary unprivileged container and
preferred orchestration software is how I feel it should work eventually as
well. Unfortunately right now we have to target very specific environments
though so we can focus our development efforts so eventually may take a little
while.

It's great to hear about these concerns early on so thanks for taking the time
to comment. I'll definitely make sure that we hold on to this as a core tenet.

------
jdoliner
Hi Guys,

Co-founder here. We'll be following this thread all day so feel free to ask us
any questions. Cheers!

