
Pachyderm Raises $10M to Bring Data Provenance to the Enterprise - jaz46
http://www.pachyderm.io/2018/11/15/Series-A.html
======
mmasters
I was part of an experimental neuroimaging group that tested Pachyderm OSS
years ago and at the time we were really impressed with the versioning
capabilities it provided. For us at the time it made it easy for each
researcher to grab and change data as needed for their own development without
requiring support from eng.

~~~
marmaduke
How well does that work when you datasets are a sizeable percentage of
available storage capacity, though? Is there some sort of deduplication at
work?

~~~
jaz46
Pachyderm does a ton of data deduplication, both for input data that's added
to pachyderm repos as well as for output files.

Pachyderm's pipelines are also smart enough to know what data has changed and
what hasn't and only process the incremental data "diffs" as needed. If your
pipeline is just one giant reduce or training job that's can't be broken up at
all, then this isn't valuable, but most workloads include lots of Map steps
where only processing diffs can be incredibly powerful

~~~
marmaduke
This is super cool, thanks for pointing that out. Is the hard part done by
Pachyderm or as some layer over container file systems?

~~~
ztjio
Pachyderm does it. It's like half of what pachyderm does, manage the versioned
data, and schedule workers to run your containered processes against them.

FYI, it's ridiculously easy to get going playing with Pachyderm if you just
want to check it out. You can run it on Minikube.

~~~
marmaduke
> You can run it on Minikub

Thanks for the tip. I just started down the k8s path from bare metal cluster
and will try this.

------
marmaduke
I have a “data science pipeline” coordinated with a Makefile and run on CI/CD
(GitLab) with reports generated as build artifacts. Big stuff checked in with
Git LFS.

Why would I use Pachyderm?

~~~
jaz46
Pachyderm pipelines can be run in a massively distributed fashion with data
being sharded across many workers. Pachyderm also offers much better failure
semantics than Make + CI. For example, if one shard of data of your pipeline
fails such as a node or container dies, Pachyderm will automatically make sure
that the data get rescheduled to another worker.

Each pipeline can have separate resource requirements (e.g. GPUs, lots of
memory, etc) and gets scheduled by Kubernetes.

Finally, Pachyderm is versioning all of the intermediate steps in your data
pipeline so if a downstream step fails, you don't have to restart from
scratch, you can pick up right where it left off.

~~~
agibsonccc
Congrats on your round! FWIW, (me being in a space adjacent to yours) I see
these kinds of comments all the time about the "my 1 off pipeline seems good
enough".

Most people don't need "industrial strength" till you hit a certain scale.
They tend to optimize for ease of use/simplicity. It's one thing if you don't
have to change their workflow, it's another if you have to not only have
people change their workflow but also teach something new.

Convenience matters more. There's a pain threshold of "new tool" vs "this
costs me x amount of time".

How are you guys overcoming this? Even though we're also in the infra space,
I've never seen you guys out in the wild. Where would I bump in to you and
what is the scale I would want to add the complexity of k8s + pachyderm +
whatever other deps you guys have over just using an S3 bucket?

Another question: Why hasn't AWS just packaged this up and offered it as an
extension of their k8s service? How are you guys going to overcome that?

Usually I see "hybrid cloud" or "on prem" as the response. If it is on prem,
are you guys relying on the presence of k8s at customers? Do you guys use
something like gravitational?

~~~
marmaduke
> my 1 off pipeline seems good enough

Not what I was describing. We’ve been using the setup for a while with
multiple projects and for the data science end of a clinical trial.

I like building new setups incrementally out of tools I already know, so the
question is obviously why is this new tool a radical improvement

------
alex_lfw
He was holding these two pieces of pizza...

~~~
jdoliner
Hi, Pachyderm CEO here. We have been discussing this comment internally for
the last 15 minutes and still don't understand it. But we really enjoy it.
Would you mind explaining a bit more what you are saying?

~~~
Liru
Not the person you are responding to, but the comment you're replying to is
referencing Seinfeld, due to the name of your company.

[https://en.wikipedia.org/wiki/The_Stand_In_(Seinfeld)](https://en.wikipedia.org/wiki/The_Stand_In_\(Seinfeld\))

~~~
jaz46
What a wonderfully obscure reference.

------
grbno
I've also wondered why I should use Pachyderm. Decided to give it a try, and
wrote the following blog about it :
[https://medium.com/bigdatarepublic/pachyderm-for-data-
scient...](https://medium.com/bigdatarepublic/pachyderm-for-data-
scientists-d1d1dff3a2fa) " Finally, version control for your data "

------
mb4
Congrats guys! Data provenance is only becoming more important

------
akircher
This is great news. An amazing team with a good solution to a huge problem. I
look forward to following your progress.

------
coolhand1
Love this project! Git for data is such a brilliant concept

------
otoolep
Congrats to the team at Pachyderm!

------
_drFaust
much deserved, been following this project for sometime and have continued to
be impressed.

------
koolhead17
Congrats Pachyderm team.

------
atav1k
nice!

