
Pachyderm (YC W15) Challenges Hadoop with Containerized Data Lakes - pkbarber
http://thenewstack.io/pachyderm-aims-displace-hadoop-container-based-collaborative-data-analysis-platform/
======
daveguy
Very cool. I have been thinking about analysis reproducibility lately and this
is an interesting approach. I have two questions:

1) what is your business model? I see a great open source project and some
good information about why this is needed. However, I don't see any pricing
for products or consulting. How are you making money?

2) It is good that you have shed the overhead of java and have the capability
to interface with multiple languages. However, one complication I see that you
have adapted is containerization. I understand the need and simplicity with
respect to saving snapshots and reproducing results. However, interacting at
the level of container manipulation is quite a bit of overhead for the average
user. Is there a plan to add an simplifying interface layer on top of the
container abstractions?

~~~
jdoliner
Good questions both.

We make money by offering support and services for Pachyderm. This includes a
couple of things. Implementation services in which we help people setup a
working Pachyderm cluster on their own infrastructure, help train their
engineers on how to use the system and establish a plan for getting the
company's data and analysis running on the new system. Support, where we help
people with any issues that arise using the system. And consulting, where we
get a lot more involved and help people actually build things on top of
Pachyderm that their company needs.

If you're interested in these services feel free to drop me a line at
jdoliner@pachyderm.io

We totally agree that containers aren't a suitable interface for everyone. We
see containers as the ideal interface for developers. Over time we're
expecting developers to build simpler interfaces on top of it. We'll also be
building a lot of those ourselves. For example one feature we have on our near
term roadmap is a SQL layer. We're going to implement that in terms of our
container API which means that it should be very possible for a developer to
implement a totally new access pattern just by throwing some code in a
container.

------
hcatlin
I'm not really sure what the use cases are for collaboration on this kind of
thing or what that means. Are there real-world use cases I'm not thinking of?
"Collaboration" is a great buzzword, but what does it have to do with a data
store?

~~~
sjezewski
(full disclosure - pachyderm employee)

Good question. It's funny how much collaboration is overlooked. And you're
right - it's not obvious how a data store can enable collaboration.

In the software engineering world, collaboration by means of git is so
prevalent its like breathing air. There's no such thing today for data
scientists! That's crazy! Because doing data science involves more variables
than writing software alone.

Pachyderm stores your data in a git-like manner. We store the deltas and
version the data so that its consistently reproducible. We also give you some
nice tools to run any code alongside your data.

This enables some very basic workflows:

1 - You're trying to develop your analysis - so work on your code & lock your
data

2 - You're trying to vet new data - develop and version your feature
extraction and data together

3 - You're trying to work on some analysis w colleagues - fork the data +
analysis to do your work ... then merge to make sure your work is compatible
before deploying

There are many more ... but hopefully that makes it a bit more concrete

~~~
sjezewski
And I should add we talk about Collaboration and other design goals more here:
[https://pachyderm.io/dsbor.html](https://pachyderm.io/dsbor.html)

------
geodel
I am delighted by this. I tried hadoop in past and found it quite complex with
badly designed APIs which changed drastically between ver 1.x and 2.x. I see
Hadoop as a manifestation of Java culture: big, complex, memory hog.

Since Pachyderm is written in Go I hope it promotes a Go culture: Simple,
consistent and not a resource hog.

~~~
jdoliner
I'm happy you're delighted!

I agree very strongly with what you've written here too. I think it's
unavoidable that programs wind up taking on some of the cultural qualities of
the language that they're written. It's a very interesting phenomenon that I
could probably talk about at length... but I digress. This was a very
conscious part of our decision to use Go. Its simplicity, consistency and
reasonable resource requirements have been hugely influential, but the biggest
one has been its pragmatism. Go is laser focussed on actually getting things
done, and the most prominent things written in it clearly have taken on this
ethos, etcd, Docker, Kubernetes to name a few. All are relatively new tools
with alternatives written in other languages that have existed for a while.
But they've all found use because of how well they've embrace this ethos and
how appealing tools like that are to people. We use those tools internally and
have very much tried to make Pachyderm fit in with the ethos as well.

------
manojlds
Uses Kubernetes underneath, which is pretty interesting. Whole premise of
containers as MR primitives is also good. It will boil down to how the
filesystem, the pipeline service and the clustering is implemented and how
well they work together.

~~~
jdoliner
Kubernetes has worked really well for us, it's given use distributed systems
primitives that just work which has allowed us to focus on the parts of the
system that are unique to what we're doing.

We get a lot of questions about why we choose to implement our own filesystem
and computation layer and the answer we gave is the one you have here, it
allows us to make sure they work well together because they're built from the
ground up with containerized workloads in mind. We could have tried to
containerize Hadoop, but I doubt it would have resulted in the fully
integrated system that we have today. Too much in the Hadoop ecosystem was
written for a precontainer world for it to make sense to containerize it after
the fact.

------
eterps
It uses Kubernetes underneath, which is nice for scaling but how do you run
automated functional, end-to-end or integration tests with this on your
development (or CI) machine?

~~~
sjezewski
Great question!

For local development - we recommend connecting to a dev k8s cluster, which is
easy enough to setup.

CI is actually something I get really excited about. Similary for CI you can
have a separate k8s cluster for any tests you need to run. But because we
version everything, you can run 'unit tests' on sample data sets that you know
should be rock solid ... and also 'integration tests' on the
_same_versioned_data_ you would see in production.

~~~
jdoliner
Just to clarify, a dev k8s cluster doesn't require anything more than a stock
installation of Docker. K8s has actually gotten pretty easy to test against
recently thanks to being deployable as containers.

~~~
eterps
Do you have some examples or links to share?

~~~
jdoliner
This is the script we use to start k8s for tests:
[https://github.com/pachyderm/pachyderm/blob/master/etc/kube/...](https://github.com/pachyderm/pachyderm/blob/master/etc/kube/start-
kube-docker.sh)

There's k8s docs on this somewhere but unfortunately they move around a lot so
I can't find a link right now. It's just 1 docker run command though and
you're good to go.

------
taliesinb
Seems very similar to Joyent's Manta.

