

Pachyderm (YC W15) Raises $2M to Power Big Data Processing for the Docker Era - jaz46
http://techcrunch.com/2015/06/10/pachyderm-raises-2-million-seed-round/

======
jdoliner
Hi Guys, co-founder here. We'll be hanging out on this thread all day to
answer any questions.

~~~
TallGuyShort
I'm curious about the decision to target Docker so specificially. Is that just
how you plan to manage resources between different users / workloads? Or is
there another benefit?

I work in the Hadoop space (although I like the stack as a whole, I do see the
downside of having evolved the stack over time with a bunch of independent
projects over time, so although I suspect Pachyderm will have a long way to
go, I watch with interest). We mainly use Docker containers for testing /
training, etc. A distributed system allows you to scale, but you still want
each node to be as big as possible, so in production I'd never use Docker.
YARN of course puts things in to "containers" to manage resources, but each
container is still running in the same OS. Targeting a system that partitions
the resources of a physical machine so completely seems counter-intuitive for
Big Data, so what's the thinking there?

edit: or was the focus on Docker added by TC to attract more clicks? :)

~~~
jdoliner
Containers solve a few problems for us. They allow us to provide strong
isolation guarantees for jobs and offer nice resource quota semantics (via
cgroups.) They also serve as a good format for specifying jobs in because they
can be easily passed around between nodes via docker pull and push.

Our use of Docker is very similar to how you describe YARN's use of
containers. We deploy on a cluster of CoreOS machines and each machine is
generally as big as possible. We then pack several containers on to that same
machine so that we can get as much performance as possible out of it. This
also gives us an opportunity to do smart things like sharing data volumes
between docker containers that are running different workloads over the same
data.

------
znt
Could you please explain the difference between datomic
([http://www.datomic.com/](http://www.datomic.com/)) and "The Pachyderm File
System"?

~~~
jdoliner
Hi! Good question.

Datomic and pfs have very similar data models, we drew some inspiration for
our architecture from them.

They're targeting different use cases though. Datomic is a database so it's
meant as the source of truth that you run your application off of. It has
features like optimized point queries, indexing, caching etc. Pachyderm on the
other hand is meant for analytic workloads, so we focus on having very fast
aggregations and tools for long running queries and workloads.

Hope this clears it up for you :)

~~~
znt
Thanks for the answer, where can we submit our suggestions for
examples/tutorials?

I mean having an example project that runs some Python/R/C# code against
historical stock market data would be a good use case.

~~~
jdoliner
This example seems really interesting, we should do it.

Github issues would be a fine place I think.

Or emailing me at jd@pachyderm.io if you'd prefer :)

------
redwood
I'm sorry and don't want to be negative. Only helpful. Change your name asap.

You don't want "derm" on there. This is not a skin medicine.

Pachy is a little odd too; but less so.

Pachyd ?

