

Pachyderm (YC W15) – A Data Processing Tool for the Docker Generation - mihwalski
http://techcrunch.com/2015/01/23/pachyderm/

======
buryat
> Companies that want to get serious about data analysis often have to hire
> elite programmers who specialize in writing Hadoop MapReduce jobs.

There's no need to do so, you can use Hadoop Streaming
([http://hadoop.apache.org/docs/current/hadoop-mapreduce-
clien...](http://hadoop.apache.org/docs/current/hadoop-mapreduce-
client/hadoop-mapreduce-client-core/HadoopStreaming.html)) and anything that
works with stdin/stdout.

~~~
jdoliner
Hadoop streaming is definitely a great way to use Hadoop and actually provided
a lot of the early inspiration for Pachyderm. The thing that I always found
difficult when using it was dependency management. If I wanted to use
libraries it was a pain to get them distributed across all of the machines in
the cluster and invariably I'd get one machine with a different version of
something that would mess things up. This problem was the impetus for using
Docker to do MapReduce.

~~~
buryat
You can specify files & archives that have to be copied to nodes before
executing the job:

> The -files and -archives options allow you to make files and archives
> available to the tasks. The argument is a URI to the file or archive that
> you have already uploaded to HDFS.

[http://hadoop.apache.org/docs/current/hadoop-mapreduce-
clien...](http://hadoop.apache.org/docs/current/hadoop-mapreduce-
client/hadoop-mapreduce-client-
core/HadoopStreaming.html#Working_with_Large_Files_and_Archives)

------
mglukhovsky
Joe gave a great talk on Pachyderm at a recent RethinkDB meetup that's
definitely worth watching:
[https://www.youtube.com/watch?v=kRjk_Xsf7t4](https://www.youtube.com/watch?v=kRjk_Xsf7t4)

------
Yadi
This is pretty awesome!

I was sold at this point:

'“if you can fit it in a Docker container, Pachyderm will distribute it over
petabytes of data for you.” One cool example that uses Pachyderm is this
MapReduce job for analyzing and learning from blunders in chess games.'

~~~
jdoliner
Hi!

We actually did that demo as part of a post. We just released it right here:
[https://medium.com/@jdoliner/when-grandmasters-
blunder-a8198...](https://medium.com/@jdoliner/when-grandmasters-
blunder-a819860b883d). Hope you enjoy it.

~~~
Yadi
Pretty cool!

Thanks!

------
thinkingkong
So its Manta with docker instead? See
[https://github.com/joyent/manta](https://github.com/joyent/manta)

~~~
jdoliner
I haven't actually used Manta but I think in terms of computation model this
comparison is fair. According to the docs: "Users express computation in terms
of shell scripts, which can make use of any programs installed in the default
compute environment." The problem I see with this is that it requires you to
manage a global environment which can get difficult for large teams.

------
jdoliner
Hi guys, founder here. We'll be hanging out in this thread all day to chat
with you guys about what we're doing. Ask us questions!

~~~
hobofan
Hi there, Pachyderm looks like it really fits into the architecture we are
going for.

I only took a brief look, but it looks like it is only loosely tied to CoreOS?
So hacking it to run on Kubernetes/Apache Mesos etc. should be relatively
easy?

~~~
jdoliner
Hi! Yes, pfs is designed with the "batteries included but removable" ethos in
mind and is only loosely tied to CoreOS. Getting it to run on Kubernetes and
Mesos is something we're definitely going to do at some point in the near
future. I've set up an issue here:
[https://github.com/pachyderm/pfs/issues/29](https://github.com/pachyderm/pfs/issues/29)
for tracking this effort feel free to stop by and discuss!

Also if you'd prefer email drop me a line at jdoliner@pachyderm.io.

~~~
curun1r
I had the same reaction to seeing CoreOS in the intro as the other
poster...I'd love to play with this, but there's no way that I'd get approval
from our security folks to use this in any serious capacity.

If you're looking for other tools to support, consider Consul, which should
have an almost identical interface as etcd. Also, I'm not sure if ECS and/or
Docker Swarm would satisfy the rest of the CoreOS responsibilities, but those
are the two that it looks likely that we'll be able to use.

~~~
jdoliner
Hi!!

Just to clarify, is CoreOS in general a no go for you guys and is that due to
its newness or something else? I'm unfortunately pretty ignorant on what the
approval process looks like for security folks.

The good news is that pfs is designed to be loosely coupled and we're
definitely going to look in to support for a variety of deployment methods.
Consul in particular looks like it would be a very straightforward to add,
I've created an issue for it here so you can follow it:
[https://github.com/pachyderm/pfs/issues/31](https://github.com/pachyderm/pfs/issues/31).

I'm hopeful about ECS as well it smells like it might be the simplest thing
that could work for pfs. I'd love it if we could not even deal with a notion
of machines in pfs deployments.

~~~
hobofan
I can't speak for curun1r, but the requirement for a whole OS is pretty
restricting, especially when you already have a whole environment with
configuration management etc. based on another OS in which you invested a lot
of time.

From personal experience (granted this is from the early days of CoreOS),
CoreOS is only really viable if you are going for a hardcorde "dockerize
everything" approach, which has too much unpolished points as of now. As of
now I'd rather use a rigid base (whatever OS, chef, mesos, marathon) with the
more fluid parts (services inside docker, pachyderm, etc.) on top.

------
tetron
This is very interesting! I will have look into this more. Two quick comments:

How does this compare to Arvados ([http://arvados.org](http://arvados.org))?

You might be interested in the common workflow language effort which among
other goals is developing a multi-vendor standard for wrapping analysis tools
using Docker ([https://github.com/common-workflow-language/common-
workflow-...](https://github.com/common-workflow-language/common-workflow-
language))

~~~
jdoliner
Hi,

Just read up on Arvados. I'm totally new to it but it seems really cool. I
think pfs and Arvados are very philosophically aligned. Arvados seems to
believe as we do that the most important thing missing from data science right
now is collaboration. I didn't have the time to fully grok their architecture
but it seems like there are some differences there. In particular I don't know
if Arvados has git-like semantics. It does on the other hand have a ton of
cool features that pfs doesn't have yet such as content addressable storage
and support for in memory databases that help speed up computations (I think
that's what they do at least.) Definitely a very interesting project and
hopefully good ideas can flow between the ecosystems.

The common workflow language effort is definitely very interesting as well. It
looks like it's bioinformatic specific right now but we're planning to have
something similar but more generic in the near future and I'll definitely be
looking to cwi for inspiration.

Thanks so much for the links they're both very interesting.

------
ende
If you're looking to build complex analysis pipelines using docker, check out
NextFlow ([http://www.nextflow.io](http://www.nextflow.io))

