
Pachyderm Hub: data science without the hassle of managing infrastructure - jdoliner
https://www.pachyderm.com/blog/pachyderm-hub-is-now-in-production/
======
data_ders
I'm quite excited for the emerging Python-, Git-, and DAG-driven data
engineering workflows/orchestrators that are code-first, GUI-second. But is
anyone daunted at the prospect of investing in one of these platforms, then
having to pivot to the paradigm that emerges as the winner 3-5 years from now?
I've been trying to keep up with all the tools, but am pretty overwhelmed at
this point. Obviouly there's differences, but off the top of my head I can
think of: Airflow, Luigi, Pachyderm, DAGster, Google Cloud Compose, Kuberflow,
MLFlow, Azure ML Pipelines, Ascend.io, and so so many more.

~~~
slewis
Shameless plug: my company Weights & Biases
([https://wandb.com](https://wandb.com)) has tackled this in a different way.
We make tools to keep track of results across your pipelines regardless of how
you choose to orchestrate execution. This is one of our big selling points:
very simple on-ramp to reproducible tracking, and infrastructure agnosticism.
These have led to broad adoption across companies and academics. We also have
a pretty cool UI.

Pachyderm makes different tradeoffs and we're excited to see their launch.
Seriously congrats to you all! This is an invigorating space to say the least
;).

~~~
jdoliner
We have at least one customer who's using both wandb and Pachyderm together. I
actually don't think they overlap that much, although I may be
misunderstanding that wandb does. Pachyderm doesn't do anything to track
models explicitly. Our tracking is specifically about data lineage, i.e. what
data was fed into this container to create this result. It looks like wandb
does do dataset versioning but it's unclear to me if it's storing references
to versions of the dataset or if it's actually the source of truth that's
storing the data. I think it's the former but I'm not sure. Pachyderm focuses
on the later of being the system that stores and versions the large datasets
and presents a unified hash for them. Other systems can then record that hash
to have immutable versioned datasets they can rely on.

We do have a few philosophical differences though. The biggest being that
we're opposed requiring data-scientists to including tracking code to trace
their results. Every company we've seen use systems like that winds up with
different levels of instrumentation on different pipelines and a lot of
experiments with no instrumentation. It makes it hard to answer questions like
"what are all the places this data is being used?" conclusively, because
there's always this "dark matter" of code that hasn't been instrumented that
your code doesn't see. We prefer a system that automatically tracks things
without asking the user to do anything so that information is always collected
and there when you need it. That being said, I'm not sure how you could do the
type of finegrained model tracking you guys do automatically. We can track
lineage automatically because we're the system that stores and exposes the
data, so we know when it's being accessed.

~~~
slewis
FWIW, our Artifacts tool can track references to data that exists in other
systems, or store the data directly which allows us to ensure there are no
"dark matter" users like you mentioned. It's a flexible system.

~~~
jdoliner
The key difference is that it's something you _can_ do, rather than something
that just happens automatically. It's not about what you're able to do, it's
about what you can forget to do. Storing references to external systems is
something that people do in Pachyderm as well, but we discourage it because
those external systems generally aren't immutable. So they can be deleted, or
even worse change, at which point your lineage tracking is actually misleading
you rather than informing you.

------
hobofan
After following Pachyderm for a long time, the docs and everything finally
look to be at a point where I really want to try it out!

There is no point though in hiding the pricing behind the signup, especially
if it's the direct first step you see after you signed up. All you are doing
is inflate your signup numbers (and maybe even deter a few people).

PS: The link under "Resources -> What's the typical developer workflow?" is
broken.

~~~
jdoliner
Thanks for staying involved with the project, we're so glad to hear that the
improvements are working.

You bring up some good points about the pricing page. We'll be making the
pricing visible without login soon. If you have some time we'd love to get
some more feedback via our slack channel (pachyderm.com/slack). With you being
involved for so long your feedback would be super insightful.

(Fixed that link. email us and we'll send you some Pachyderm swag if you like)

~~~
sauwan
I agree. I left the page immediately when it wasn't clear what the pricing
would look like. No use investing any of my time into something I don't know
if I can afford or get into my work budget.

~~~
sethammons
yup. I've seen "hidden" prices be as low as $100/yr to as much as $10k/mo. Now
I don't dig in deeper and just move along unless I'm desperate.

------
ryan_j_naughton
Awesome!

We switched from airflow to pachyderm recently, and we have loved it (we still
have some legacy on airflow, but all new data engineering dev is in
pachyderm).

~~~
jdoliner
Thanks for the kind words, Airflow -> Pachyderm has been an increasingly
common migration path for us.

------
jamesblonde
The data science platform looks great, but I have a fundamental issue with
building a data versioning platform using git-like semantics - Pachyderm and
DVC follow this approach. Git-like approaches just track immutable files -
they do not store the differences between files. That's fine when you're
working with small amounts of data, but doesn't scale to handle large volumes
of data and prevents you from making some times of time-travel queries - give
me the data that changed in this time range. The emerging enterprise approach
to handle this is to use a framework like Delta.io, Apache Hudi, or Apache
Iceberg. These frameworks add metadata over Parquet, and allow you to do point
point-in-time time-travel queries as well as interval-based time-travel
queries (give me data that changed in this time interval). Parquet is becoming
a dominant file format for data lakes and is cheap to store on s3.

Disclosure: I am a co-founder of Logical Clocks.

------
anilshanbhag
Looking at this - it seems very similar to Verta.AI Both are doing data
lineage/workflow management ~ git for data science workflows. Verta recently
raised their series A and launched their platform too.

~~~
jdoliner
I'm not super familiar with Verta.AI but reading about their product it seems
pretty different. I read Verta.AI as a model deployment and tracking solution.
Seldon would be the opensource comparison that immediately comes to mind.
Pachyderm is more of a general pipeline and data version control system. You
can do some very basic model deployment and tracking in it but we general
recommend people use an external system such as Seldon and even have some
published examples of how to integrate the two.

Disclosure: I founded Pachyderm.

------
racl101
I wanted to install R to try it and test it to see if it was something I could
work with on MacOS (Mojave).

It was nothing short of an f-ing nightmare.

Wonder if this service will help.

~~~
claytonjy
I don't think this solves that problem, or intends to. You can run R in
pachyderm via Docker, but that's unlikely to be a good way to test R out.

If you like Docker, try the images from Rocker: [https://www.rocker-
project.org/](https://www.rocker-project.org/). The one with RStudio server is
especially useful for local experimentation.

~~~
racl101
Thanks, I'll take a look a this.

------
culturestate
Quick note for @jdoliner: there's a typo on the homepage in the Useful Links
section - " _Technicial_ Slack Channel"

------
claytonjy
Can this run in a peered VPC connection or will I have to pay for data
transferred from my public cloud infrastructure to this service?

~~~
jdoliner
It can't right now, it all runs in GCP. You can still spin up Pachyderm
yourself in whatever environment you like. We're working on enabling custom
deploys in more environments than the one we just have but it's still a little
ways out.

------
tspann
I often need Dataflow, Spark Engineering, NoSql and Sql datastores, enterprise
security and hybrid cloud with my datascience
[https://docs.cloudera.com/machine-
learning/cloud/product/top...](https://docs.cloudera.com/machine-
learning/cloud/product/topics/ml-product-overview.html)

------
adamsvystun
Couldn't find any info on pricing. How much more expensive is this than
running your own cluster?

P.S. Couldn't login. Stuck on
[https://hub.pachyderm.com/pricing](https://hub.pachyderm.com/pricing). I am
just seeing a white screen. (Login with Google, Linux, Firefox, Ukraine)

