
Show HN: Orchest – Data Science Pipelines - ricklamers
Hello Hacker News! We are Rick &amp; Yannick from Orchest (https:&#x2F;&#x2F;www.orchest.io - https:&#x2F;&#x2F;github.com&#x2F;orchest&#x2F;orchest). We&#x27;re building a visual pipeline tool for data scientists. The tool can be considered to be high-code because you write your own Python&#x2F;R notebooks and scripts, but we manage the underlying infrastructure to make it &#x27;just work™&#x27;. You can think of it as a simplified version of Kubeflow.<p>We created Orchest to free data scientists from the tedious engineering related tasks of their job. Similar to how companies like Netflix, Uber and Booking.com support their data scientists with internal tooling and frameworks to increase productivity. When we worked as data scientists ourselves we noticed how heavily we had to depend on our software engineering skills to perform all kinds of tasks. From configuring cloud instances for distributed training, to optimizing the networking and storage for processing large amounts of data. We believe data scientists should be able to focus on the data and the domain specific challenges.<p>Today we are just at the very beginning of making better tooling available for data science and are launching our GitHub project that will give enhanced pipelining abilities to data scientists using the PyData&#x2F;R stack, with deep integration of Jupyter Notebooks.<p>Currently Orchest supports:<p>1) visually and interactively editing a pipeline that is represented using a simple JSON schema;<p>2) running remote container based kernels through the Jupyter Enterprise Gateway integration;<p>3) scheduling experiments by launching parameterized pipelines on top of our Celery task scheduler;<p>4) configuring local and remote data sources to separate code versioning from the data passing through your pipelines.<p>We are here to learn and get feedback from the community. As youngsters we don&#x27;t have all the answers and are always looking to improve.
======
ellisv
> We're building a visual pipeline tool for data scientists.

As a Sr. DS/ML Engineer, this doesn't speak to me.

~~~
jpau
I personally agree that this does not seem too useful for any DS team that
needs to deploy a model to production. But there are a whole horde of DS teams
whose outputs are basically PPT presentations.

Think like pricing forecasts, decision modelling, marketing segmentation for
product design, ....

I think a common thread to these teams -- at least those that I've seen -- is
that they consist of stats->DS backgrounds, and no eng->DS backgrounds.

Many of these teams are orchestrating everything within the notebook. I've
seen notebooks that contain complex workflows that extend to 10K LOC. I've
lost sleep over such things.

~~~
FridgeSeal
I agree, and so many data science and data engineering tools all seem to
revolve around using notebooks, much to my frustration. I’ve worked in places
whose data pipelines were built around seemingly infinite notebooks, all
containing consistently poor software engineering.

It’s been enough to make me vow to not let people write notebooks that go into
prod under my watch lol.

I’m constantly on the watch for software engineering focused tools that solve
the issues, rather than data science/engineering focused tools. So many are
inextricably linked into python as well, which doesn’t gel nicely with
anywhere that has multiple languages in the codebase.

~~~
calebkaiser
Shameless plug, but I help maintain Cortex, and "software engineering focused
tools that solve (ML) issues" is a neat summary of our entire philosophy. For
example, instead of notebooks, our model serving platform
([https://github.com/cortexlabs/cortex](https://github.com/cortexlabs/cortex))
uses YAML to structure deployments, and Python scripts to write inference
APIs.

It's still inextricably linked to Python, but only for writing your API. It's
agnostic as to how the model itself is developed, so long as it can generate
predictions.

------
ishcheklein
Reminds me a bit of [https://plynx.com/](https://plynx.com/) , and it's also
open source. Is there a major differentiator I'm missing? Also, what is your
idea regarding the use case. Why would I need to run it locally for example?
Is it mostly about productionizing ML?

~~~
jpau
Similarly, there seems to be partial overlap with MLFlow for tracking
iterations.

I would find a comparison table vs. existing tools useful, to help me consider
Orchest by placing it in my existing workflow.

~~~
ricklamers
We want to try to make it easier for people to understand the landscape of
tools and our position within it.

I personally like something like GitLab's [https://about.gitlab.com/devops-
tools/](https://about.gitlab.com/devops-tools/). We'll try to put up something
similar on our website at some point.

I'm not deeply familiar with MLFlow, but from what I have seen/read it is more
of a tracking framework that you can integrate into an existing codebase.

While Orchest allows you to take your existing codebase and structure it into
a pipeline to get a visual and containerized way of interacting with the
codebase (allow a mix of notebooks and .py/.sh/.R scripts), running the
pipeline, and visually inspecting success/failure of pipeline runs/steps.

Another key point of difference is how we are more concerned with managing the
flow of data. Since we let you build pipelines we can give you abstractions to
separate data flow from the pipeline code. I.e. letting you define generic
pipelines that take any data source (in some standard form, like a schema'd
database) and produce reports. Because we control the data source in relation
to the containerized pipelines we can also make sure the whole thing performs
well when it's being executed in parallel (i.e. same version of the pipeline
running grid search over paramaterized pipelines). In other words, we also
control more of the underlying infrastructure when executing the pipelines.

------
vasinov
This looks cool! A couple of questions:

1\. Currently, if I install something in the notebook, does it get re-
installed every time the pipeline is run? Is there any way to "snapshot" the
state of the container?

2\. Where is the data stored between the steps?

3\. How well-integrated is it with AWS cloud primitives such as EC2 instances,
EFS, and S3?

~~~
ricklamers
Thanks!

1\. Right now additional dependencies for the container need to be re-
installed whenever you run the pipeline. During the entire Jupyter kernel
session though, the container state and thus any installed dependencies remain
available. We're working on either supporting container snapshots or custom
container images (with desired dependencies pre-installed). We'll likely go
with snaphots as they'll be easier from an end-user perspective.

2\. During step execution data is stored inside of either the pipeline
directory (which contains for example the .ipynb/.py/.R/.sh files) or in any
of the mounted directories (through data sources).

When you run the pipeline as part of an experiment a copy is created so that
any state generated by any of the steps inside of the pipeline directory is
isolated from the 'working copy' of the pipeline.

Edit: forgot to mention that we support memory-based data transfer between
steps which is faster and doesn't "pollute" your pipeline directory. It does
require your data to fit in memory though. We use Apache Arrow's Plasma for
this.

3\. AWS S3 and AWS Redshift are currently supported as data sources. Some
light docs at [https://orchest-
sdk.readthedocs.io/en/latest/python.html#dat...](https://orchest-
sdk.readthedocs.io/en/latest/python.html#datasources) (to be improved!) and
the relevant SDK source ([https://github.com/orchest/orchest-
sdk/blob/master/python/or...](https://github.com/orchest/orchest-
sdk/blob/master/python/orchest/datasources.py)). We should look into EFS. Do
you have a use case in mind?

------
pplonski86
Congratulations! I remember your earlier project: grid studio. Do you support
scheduling periodic tasks? Do you support execution triggered with webhook? or
some way to expose notebook as REST API?

~~~
yannickperrenet
Grid Studio is indeed the project Rick worked on before starting to work on
Orchest. Great to hear that you are familiar with it.

Currently, tasks can only be scheduled to run at a set time. Could you
elaborate a little on why you would want tasks to run periodically? We have
some ideas on why this might be helpful, but would love to hear your take on
this. (Periodic task scheduling is absolutely something we can add.)

The front-end of the application actually makes calls to our API for multiple
things, among which execution of tasks (our internal endpoints can be found
here
[https://github.com/orchest/orchest/tree/master/orchest/orche...](https://github.com/orchest/orchest/tree/master/orchest/orchest-
api/app/app/apis)). Exposing a (REST) API to the user to interact with the
pipeline and start executions is on our roadmap.

I hope this answers your questions.

~~~
pplonski86
Scheduling periodic tasks can be useful for creating ETLs or to create alarms
(every 1h hour check data for new values if condition is meet then send the
email alert)

~~~
yannickperrenet
Agreed! ;)

------
Obinkhorst
Thanks for sharing, this is super helpful. I'm endlessly jealous of the teams
at Uber and Booking and their fancy tools

~~~
ricklamers
We'll make sure the rest of the world gets those great tools too!

------
rgmvisser
Really cool! I can’t wait to start playing with it.

Can two people collaborate on the same project at the same time?

~~~
ricklamers
That's great to hear! Right now it's not fully supported to edit a pipeline at
the same time. We're moving towards a git-based async collaboration approach
where you can fork and merge pipelines to make sure changes you make to
code/Notebooks aren't going to surprise you in your analysis/models.

------
abalaji
How do you think about this compared to something like Dataiku?

~~~
ricklamers
Great question. We think a key point of difference is that we'll never focus
on providing a 'clicky' way of building actual data processing, training,
transformation steps.

For example, in Dataiku you can define and transform columns by using a GUI.
We never saw that as more productive than writing transformations using i.e. R
data frames, Pandas or Koalas (Pandas on Spark). The Python/R scripts that do
the actual transformation can be cleanly versioned, re-used, and modified in a
much nicer way than anything produced with GUI based
transformation/processing.

You also don't ask people to invest their time and skills into a way of doing
things that is specific to a particular tool. I.e. if you can write great data
transforms in Pandas today you don't have to change anything when you start
using Orchest to build your data pipelines.

------
xiaodai
I have wanted something like this.

Julia support?

~~~
ricklamers
Yes please! It's on our roadmap. We want to be language agnostic as much as
possible close to the spirit of Jupyter.

We have a lot in place already to make it easy to add languages, so if more
people like it we'll have Julia support coming up relatively soon.

