
Metaflow, Netflix's Python framework for data science, is now open source - vtuulos
https://metaflow.org
======
amirathi
After going through a lot of marketing fluff, I landed on this useful page
which explains Metaflow basics:
[https://docs.metaflow.org/metaflow/basics](https://docs.metaflow.org/metaflow/basics)

Here's my understanding:

\- It's a python library for creating & executing DAGs

\- Each node is a processing step & the results are stored after each step so
you can restart failed workflows from where it failed

\- Tight integration with AWS ECS to run the whole DAG on cloud

I don't know why their .org site oddly feels like a paid SaaS tool. Anyway,
thank you Netflix for open sourcing Metaflow.

~~~
est
looks like airflow with ML tools?

~~~
thundergolfer
It provides a Python DAG building library like Airflow, but doesn't do
Airflow's 'Operator ecosystem' thing.

It also is very opinionated about dependency management (Conda-only) and is
Python-only, where Airflow I think has operators to run arbitrary containers.
So Metaflow is a non-starter I think if you don't want to exclusively use
Python.

Airflow also ships with built-in scheduler support (Celery?) or can run on
K8s. Metaflow doesn't have this. Seems to rely on AWS Batch for production DAG
execution.

Airflow ships with a pretty rich UI. Metaflow seems to be anti-UI, and
provides a novel Notebook-oriented workflow interaction model.

Metaflow has pretty nice code artifact + params snapshotting functionality
which is a core selling point. Airflow is not as supportive of this so it's
harder to do reproducibility (I think). This is encapsulated by their
"Datastore" model which can locally or in S3 persist flow code, config and
data.

~~~
savin-goyal
Metaflow does come bundled with a scheduler that can place jobs on a variety
of compute platforms (current release supports local on-instance and AWS
batch). In terms of dependencies, we went with conda because of its traction
in the data science community as well as excellent support for system
packages. Our execution model also supports arbitrary docker containers (on
AWS batch) where you can theoretically bake in your own dependencies. In terms
of language support, we have bindings for R internally, that we plan to open
source as well.

I wouldn’t qualify metaflow as anti-UI. For model monitoring, we haven’t found
a good enough UI that can handle the diversity of models and use cases we see
internally, and believe that notebooks are an excellent visualisation medium
that gives the power to the end user (data scientists) to craft dashboards as
they see fit. For tracking the execution of production runs, we have
historically relied on the UI of the scheduler itself (meson). We are
exploring what a metaflow-specific UI might look like.

As for comparisons with Airflow, it is an excellent production grade
scheduler. Metaflow intends to solve a different problem of providing an
excellent development and deployment experience for ML pipelines.

~~~
orbifold
Thanks for open sourcing this! This seems like precisely the kind of tool I‘ve
been looking for. One thing that would be really great is support for HPC
schedulers like SLURM. It seems relatively straightforward to add, so I might
give it a shot myself.

~~~
seeravikiran
Happy to help either through our gitter chat or help@metaflow.org.

------
Thorentis
How is this different / better to existing tools or workflows? I don't like to
criticise new frameworks / tools without first understanding them, but I like
to know what some key differences are without the marketing/PR fluff before
giving one a go.

For instance, this tutorial example here
([https://github.com/Netflix/metaflow/blob/master/metaflow/tut...](https://github.com/Netflix/metaflow/blob/master/metaflow/tutorials/01-playlist/playlist.py))
does not look substantially different to what I could achieve just as easily
in R, or other Python data wrangling frameworks.

Is the main feature the fact I can quickly put my workflows into the cloud?

~~~
vtuulos
Here are some key features:

\- Metaflow snapshots your code, data, and dependencies automatically in a
content-addressed datastore, which is typically backed by S3, although local
filesystem is supported too. This allows you to resume workflows, reproduce
past results, and inspect anything about the workflow e.g. in a notebook. This
is a core feature of Metaflow.

\- Metaflow is designed to work well with a cloud backend. We support AWS
today but technically other clouds could be supported too. There's quite a bit
of engineering that has gone into building this integration. For instance,
using the Metaflow's built-in S3 client, you can pull over 10Gbps, which is
more than you can get with e.g. aws CLI today easily.

\- We have spent time and effort in keeping the API surface area clean and
highly usable. YMMV but it has been an appealing feature to many users this
far.

Hope this makes sense!

~~~
navinsylvester
> using the Metaflow's built-in S3 client, you can pull over 10Gbps, which is
> more than you can get with e.g. aws CLI today easily

Can you please explain how you were able to better the performance of aws cli.

~~~
vtuulos
you need more connections than what a single AWS CLI process open to saturate
network on a big box. You can achieve the same by doing some "xargs | aws cli"
trickery but then error handling becomes harder.

Our S3 client just handles multiple worker processes correctly with error
handling.

------
vtuulos
hey, I'm one of the authors of Metaflow. Happy to answer any questions!
Netflix has been using Metaflow internally for about two years, so we have
many war stories :)

~~~
bhtucker
The centralized DAG scheduler seems like a pretty important part. How much
will not having Meson hamper the usability?

~~~
pela
Hi bhtucker, We do have plans to integrate with production-grade schedulers in
the very near future.
[https://github.com/Netflix/metaflow/issues/2](https://github.com/Netflix/metaflow/issues/2)

------
aniketpanjwani
This looks exciting! I'll play around with the tutorial and try to set up the
AWS environment this weekend. I have several questions.

1\. At what sort of scale does Metaflow become useful? Would you expect
Metaflow to augment the productivity of a lone data scientist working by
himself? Or is it more likely that you would need 3, 10, 25, or more data
scientists before Metaflow is likely to become useful?

2\. When you move to a new text editor, there are some initial frictions while
you're trying to wrap your head around how things work. So, it can take some
time before you become productive. Analogously, I imagine there are initial
frictions when moving to Metaflow. In your experience, after Metaflow's
environment has already been established, how long does it take for data
scientists to get back to their initial productivity? It would be useful to
have a sense of this for the data scientist who would want to sell their
organization on adopting Metaflow.

3\. Many data scientists work in organizations which have far less mature data
infrastructure than Netflix, and/or data science needs of a much smaller scale
than Netflix. In particular, I may not even have batch processing needs (e.g.
a social scientist working on datasets which can be held entirely in memory).
In that case, is Metaflow useful?

4\. What's the closest open-source alternative to Metaflow on the market? Off
the top of my head, I can't think of anything which quite matches.

~~~
seeravikiran
1\. Metaflow should best help when there is an element of collaboration - so
small to medium team of data scientists. Collaborating with your self is also
another scenario when Metaflow can be useful since it takes care of versioning
and archiving various artifacts.

2\. Keeping the language pythonic, without any additional need to learn a DSL
has definitely been key to Metaflow's adoption internally. That said, this is
something we are open to hearing back, esp. with this OSS launch.

3\. Yes - definitely think so. Personally my favorite is the local prototyping
experience part; when everything can fit in memory and is blazing fast. There
is an also an open issue for fast-data access, which you can upvote if
interested in seeing it open-sourced.

4\. We don't think there is an exact equivalent as well. :)

~~~
thundergolfer
Re 4, aren't Kubeflow and Lyft's recently open-sourced "Flyte" pretty similar?

If you don't consider them basically equivalent, what would you say are the
key differences?

~~~
seeravikiran
Thanks for pinging on this.

re: Kubeflow - imho it is quite coupled to Kubernetes. We don’t intend to be
tied to a specific compute substrate even though the first launch is with AWS.
We do follow a plugin architecture - so I’m hoping Kube happens sometime.

re: Flyte - I’m less informed on this but happy to educate myself and get
back.

~~~
thundergolfer
That's true of Kubeflow. I'm not sure that project will be as keen on being as
"compute substrate" agnostic as Metaflow too, given its connection with
Google.

If you feel inclined jump in the Flyte Slack and share your thoughts :). At my
company we're on Kubeflow/Argo now, but things are developing quite a lot in
this space so keen to not be myopic.

~~~
seeravikiran
Thanks for sharing the context. Hopefully we can have a (fast) follow up with
Kube integration depending on demand.

------
Datenstrom
Is there a reason to use this over DVC[1] which is language and framework
agnostic and supports a large number of storage backends? It works with any
git repo and even polyglot implementations and can run the DAG on any system.

Currently using DVC, MLflow just for metadata visualization and notes on
experiments, and Anaconda for (python) dependency management. We are an
embedded shop so we don't deploy to the "cloud."

[1]: [https://dvc.org/](https://dvc.org/)

~~~
vtuulos
We are good friends with the DVC folks! If the DVC + MLFlow + Anaconda stack
works for you, that's great. Metaflow provides similar features. The cloud
integration is really important at Netflix's scale.

------
edparcell
My team has a similar library called Loman, which we open-sourced. Instead of
nodes representing tasks, they represent data, and the library keeps track of
which nodes are up-to-date or stale as you provide new inputs or change how
nodes are computed. Each node is either an input node with a provided value,
or a computed node with a function to calculate its value. Think of it as a
grown-up Excel calculation tree. We've found it quite useful for quant
research, and in production it works nicely because you can serialize entire
computation graph which gives an easy way to diagnose what failed and why in
hundreds of interdependent computations. It's also useful for real-time
displays, where you can bind market and UI inputs to nodes and calculated
nodes back to the UI - some things you want to recalculate frequently, whereas
some are slow and need to happen infrequently in the background.

[1] Github:
[https://github.com/janushendersonassetallocation/loman](https://github.com/janushendersonassetallocation/loman)

[2] Docs:
[https://loman.readthedocs.io/en/latest/](https://loman.readthedocs.io/en/latest/)

[3] Examples:
[https://github.com/janushendersonassetallocation/loman/tree/...](https://github.com/janushendersonassetallocation/loman/tree/master/examples)

------
russfink
I am disappointed that when I click on documentation, "why metaflow," I get a
bunch of cartoony BS instead of a simple text explanation. Glad these folks
don't write RFC'S.

Edit: just went to the Amazon CodeGuru homepage. Fantastic! Wish they were all
like that.

------
purple-again
We are on Azure using Spark via Databricks. We had to abandon sci kit learn
because of this choice. Does your service require AWS and can it be used in
conjunction with Spark? Thank you for your time and consideration.

~~~
missosoup
What about databricks made you abandon sklearn?

~~~
manojlds
MLLib I would think - [https://spark.apache.org/docs/latest/ml-
guide.html](https://spark.apache.org/docs/latest/ml-guide.html)

------
vtuulos
btw, if you happen to be at AWS Reinvent right now, you can get a stylish,
collector's edition Metaflow t-shirt if you drop by at the Netflix booth at
the expo hall and/or ping us otherwise!

------
cpintomammee
How does this compare to snakemake[1] and nextflow[2]?

[1]
[https://snakemake.readthedocs.io/en/stable/](https://snakemake.readthedocs.io/en/stable/)
[2] [https://www.nextflow.io/](https://www.nextflow.io/)

~~~
misterdoubt
The fact that metaflow works directly in Python piques my interest. I can lint
it, I can test it, I can format it, I can easily extend it.

I've been hesitant to commit myself and my collaborators to yet another DSL --
and that's part of why I haven't seen much to offer in snakemake and nextflow.

~~~
seeravikiran
Yes - that’s our thinking too. Compilers finding your typos for variable names
seems helpful for user productivity.

------
softwarelimits
Can anybody provide a good comparison e.g. with Meltano?

I am not affiliated with the Meltano people, but I like the idea of keeping
the system modular, what seems to make it easier to replace components.

I have no doubt that we will see better replacements for every component of a
data pipeline in the coming years. If there is only one thing to do right,
then it´s to not bet on one tool but keep the whole stack flexible.

I am still missing well established standards for data formats, workflow
definitions and project descriptions - hopefully open source ninjas will
deliver on this front before proprietary pirats will destroy the field with
progress-inhibiting closed things. It seems to be too late to create an
"Autocad" or "Word" file format for datascience, but I see no clear winner
atm, but hopefully my sight is bad - please enlighten me!

~~~
savin-goyal
I am not familiar with Meltano, sorry.

------
dj18
Seems like a cool addition to the DAG ML tooling family. Thanks for sharing!
Do you support, or plan to support, features commonly found in data science
platform tools like Domino
([https://www.dominodatalab.com/](https://www.dominodatalab.com/))? I'm
thinking of container management, automatic publishing of web apps and API
endpoints, providing a search for artifacts like code or projects, etc.

~~~
vtuulos
Good question. We have many common features covered:

\- Container management: See
[https://docs.metaflow.org/metaflow/dependencies](https://docs.metaflow.org/metaflow/dependencies)

\- Search for artifacts: see
[https://docs.metaflow.org/metaflow/client](https://docs.metaflow.org/metaflow/client)

\- Automatic publishing of web apps: we have this internally but it is not
open-source yet. If it interests you, react to this issue
([https://github.com/Netflix/metaflow/issues/3](https://github.com/Netflix/metaflow/issues/3))

Let us know if you notice any other interesting features missing! Feel free to
open a GitHub issue or reach out to us on
[http://chat.metaflow.org](http://chat.metaflow.org)

------
tristanz
This looks like a fantastically clean API for Python data and ML pipelines.
Congratulations!

It would be great to have a scheduler and monitoring UI that are equally
lightweight.

~~~
vtuulos
Metaflow comes with a built in scheduler. If your company has an existing
scheduler, e.g. for ETL, you can translate Metaflow DAGs to the production
scheduler automatically. This is what we do at Netflix.

We could provide similar support e.g. for Airflow, if there's interest.

For monitoring, we have relied on notebooks this far. Their flexibility and
customizability is unbeatable. We might build some lightweight discovery /
debugging UI later, if there's demand. We are a bit on the fence about it
internally.

~~~
thundergolfer
> For monitoring, we have relied on notebooks this far.

This is pretty interesting approach. Notebooks are a natural environment of
Data Scientist and for monitoring they'd provide really good discoverability,
flexibility, and interactivity.

Do you use more traditional monitoring that will alert you somehow if a
workflow fails, or a workflow hasn't run at all for X hrs/days?

~~~
pela
At Netflix, we rely on the workflow scheduler for such alerting and bundle in
a layer of triggering mechanism (custom notifications and such)

------
MostlyAmiable
The link in the docs to the CloudFormation template source is broken:
[https://docs.metaflow.org/metaflow-on-aws/deploy-to-
aws#clou...](https://docs.metaflow.org/metaflow-on-aws/deploy-to-
aws#cloudformation-template) Instead of /Netflix/metaflow-tools/aws it should
probably be /Netflix/metaflow-tools/tree/master/aws

~~~
seeravikiran
Thanks for reporting it. We ll fix it. Sorry for the inconvenience.

~~~
MostlyAmiable
No worries, on the whole the documentation is top notch.

------
posedge
Very interesting project. I love that this allows you to transparently switch
"runtime" from local to cloud, like spark does, but integrated with common
python tools like sklearn/tf etc. Looking forward to test metaflow out myself.

~~~
seeravikiran
Thanks. Let us know how you like the prototyping -> scaling out & up journey.

------
somurzakov
i looked over the tutorials and curious to know, whether the tutorials are
representative of how Netflix does ML ?

is data really being read in .csv format and processed in memory with pandas ?

because I see "petabytes of data" being thrown everywhere, and i am just
trying to understand how one can read gigabytes in .csv process do simple
stats like grouping by in pandas - shouldn't simple SQL DWH do the same thing
more efficiently with partitioned tables, clustered indexes and the power of
SQL language ?

i would love to take a look at one representative ML pipeline (even with
masked names of datasets, features) just to see how "terabytes" of data get
processed into a model

~~~
vtuulos
Good question! A typical Metaflow workflow at Netflix starts by reading data
from our data warehouse, either by executing a (Spark)SQL query or by fetching
Parquet files directly from S3 using the built-in S3 client. We have some
additional Python tooling to make this easy (see
[https://github.com/Netflix/metaflow/issues/4](https://github.com/Netflix/metaflow/issues/4))

After the data is loaded, there are bunch of steps related to data
transformations. Training happens with an off-the-shelf ML library like Scikit
Learn or Tensorflow for training. Many workflows train a suite of models using
the foreach construct.

The results can be pushed to various other systems. Typically they are either
pushed to another table or as a microservice (see
[https://github.com/Netflix/metaflow/issues/3](https://github.com/Netflix/metaflow/issues/3))

------
bitfhacker
It's so simple and intuitive to run two steps in parallel. Thank you, Netflix!

~~~
savin-goyal
You’re welcome! :)

------
ZenPsycho
this seems like it's very similar to [http://metaflow.fr](http://metaflow.fr)
is there any relation, or is this a name collision?

~~~
vtuulos
no relation

------
manojlds
How does it compare to dragster.io?

[https://github.com/dagster-io/dagster](https://github.com/dagster-io/dagster)

~~~
vtuulos
Orchestrating a workflow, which is what Dagster does, is just one part of
Metaflow. Other important parts are dependency management, cloud integration,
state transfer, inspecting and organizing results - features that are central
to data science workflows.

Metaflow helps data scientists build and manage data science workflows, not
just execute a DAG.

------
elwell
At first glance, I see BASIC's GOTO statement.

------
firedup
How does this compare to Kedro?

------
sriharshams
Awesome!!!

