Hacker News new | past | comments | ask | show | jobs | submit login
Metaflow, Netflix's Python framework for data science, is now open source (metaflow.org)
498 points by vtuulos 1 day ago | hide | past | web | favorite | 109 comments

After going through a lot of marketing fluff, I landed on this useful page which explains Metaflow basics: https://docs.metaflow.org/metaflow/basics

Here's my understanding:

- It's a python library for creating & executing DAGs

- Each node is a processing step & the results are stored after each step so you can restart failed workflows from where it failed

- Tight integration with AWS ECS to run the whole DAG on cloud

I don't know why their .org site oddly feels like a paid SaaS tool. Anyway, thank you Netflix for open sourcing Metaflow.

I would also add - dependency management (certain degree of reproducibility) as a first class feature leveraging conda.

So it's like Snakemake? [0] Snakemake can also control k8s clusters. Perhaps they are easier to set up with Metaflow?

[0] https://snakemake.readthedocs.io/en/stable/

Addendum - You can mix and match what steps of the DAG run on the cloud.

looks like airflow with ML tools?

It provides a Python DAG building library like Airflow, but doesn't do Airflow's 'Operator ecosystem' thing.

It also is very opinionated about dependency management (Conda-only) and is Python-only, where Airflow I think has operators to run arbitrary containers. So Metaflow is a non-starter I think if you don't want to exclusively use Python.

Airflow also ships with built-in scheduler support (Celery?) or can run on K8s. Metaflow doesn't have this. Seems to rely on AWS Batch for production DAG execution.

Airflow ships with a pretty rich UI. Metaflow seems to be anti-UI, and provides a novel Notebook-oriented workflow interaction model.

Metaflow has pretty nice code artifact + params snapshotting functionality which is a core selling point. Airflow is not as supportive of this so it's harder to do reproducibility (I think). This is encapsulated by their "Datastore" model which can locally or in S3 persist flow code, config and data.

Metaflow does come bundled with a scheduler that can place jobs on a variety of compute platforms (current release supports local on-instance and AWS batch). In terms of dependencies, we went with conda because of its traction in the data science community as well as excellent support for system packages. Our execution model also supports arbitrary docker containers (on AWS batch) where you can theoretically bake in your own dependencies. In terms of language support, we have bindings for R internally, that we plan to open source as well.

I wouldn’t qualify metaflow as anti-UI. For model monitoring, we haven’t found a good enough UI that can handle the diversity of models and use cases we see internally, and believe that notebooks are an excellent visualisation medium that gives the power to the end user (data scientists) to craft dashboards as they see fit. For tracking the execution of production runs, we have historically relied on the UI of the scheduler itself (meson). We are exploring what a metaflow-specific UI might look like.

As for comparisons with Airflow, it is an excellent production grade scheduler. Metaflow intends to solve a different problem of providing an excellent development and deployment experience for ML pipelines.

Thanks for these clarifications.

> Our execution model also supports arbitrary docker containers (on AWS batch) where you can theoretically bake in your own dependencies.

That's fair, but it doesn't seem to be something encouraged by the framework, and that's fine.

> I wouldn’t qualify metaflow as anti-UI.

Maybe anti-UI is too strong yeah. I personally think your approach could be great. Looking forward to exploring it.

Thanks for open sourcing this! This seems like precisely the kind of tool I‘ve been looking for. One thing that would be really great is support for HPC schedulers like SLURM. It seems relatively straightforward to add, so I might give it a shot myself.

Happy to help either through our gitter chat or help@metaflow.org.

I guess (?) - minus the input spec being not YAML but more language native (pythonic for e.g.)

How is this different / better to existing tools or workflows? I don't like to criticise new frameworks / tools without first understanding them, but I like to know what some key differences are without the marketing/PR fluff before giving one a go.

For instance, this tutorial example here (https://github.com/Netflix/metaflow/blob/master/metaflow/tut...) does not look substantially different to what I could achieve just as easily in R, or other Python data wrangling frameworks.

Is the main feature the fact I can quickly put my workflows into the cloud?

Here are some key features:

- Metaflow snapshots your code, data, and dependencies automatically in a content-addressed datastore, which is typically backed by S3, although local filesystem is supported too. This allows you to resume workflows, reproduce past results, and inspect anything about the workflow e.g. in a notebook. This is a core feature of Metaflow.

- Metaflow is designed to work well with a cloud backend. We support AWS today but technically other clouds could be supported too. There's quite a bit of engineering that has gone into building this integration. For instance, using the Metaflow's built-in S3 client, you can pull over 10Gbps, which is more than you can get with e.g. aws CLI today easily.

- We have spent time and effort in keeping the API surface area clean and highly usable. YMMV but it has been an appealing feature to many users this far.

Hope this makes sense!

Your first bullet point should be highlighted on your project front page! I read over the page a couple of times, and couldn’t deduce that, and I think it’s a really enticing feature!

Thanks. Feedback noted.

Can you compare and contrast with tools such as dask, dask-kubernetes, perfect[1]?

[1] https://www.prefect.io/products/core

Dask is great if you want to distribute your algorithms / data processing at a granular level. Metaflow is a bit more "meta" in a sense that we take your Python function as-is, which may use e.g. Tensorflow and PyTorch, and we execute it as an atomic unit on a container. This is useful, since you can use any existing ML libraries.

I am not familiar with Prefect.

Correct me if I'm wrong on the following, as I have not yet used Metaflow (only read the docs):

It is conceivable to execute a Flow entirely locally yet achieve @step-wise "Batch-like" parallelism/distributed computation by, in the relevant @step's, `import dask` and use it as you would outside of Metaflow, correct?

Although, as I think of it, the `parallel_map` function would achieve much of what Dask offers on a single box, wouldn't it? But within a @step, using dask-distributed could kinda replicate something a little more akin to the AWS integration?

Tangentially related, but the docs note that data checkpointing is achieved using pickle. I've never compared them, but I've found parquet files to be extremely performant for pandas dataframes. Again, I'm assuming a lot here, but I'd expect @step results to be dataframes quite often. What was the design consideration associated with how certain kinds of objects get checkpointed?

To be clear, the fundamental motivation behind these lines of questioning is: how can I leverage Metaflow in conjunction with existing python-based _on-prem_ distributed (or parallel) computing utilities, e.g. Dask? In other words, can I expect to leverage Metaflow to execute batch ETL or model-building jobs that require distributed compute that isn't owned by $cloud_provider?

As an aside: hot damn, the API - https://i.kym-cdn.com/photos/images/newsfeed/000/591/928/94f... Really cool product. Thanks for all the effort poured into this by you and others.

Yes - you should be able to use dask the way you say.

Your first part of the understanding matches my expectation too. Dask single box parallelism achieved by multi processing - akin to parallel map. And distributed compute is achieved by shipping the work to remote substrates.

For your second comment - we leverage pickle mostly to keep it easy and simple for basic types. For small dataframes we just pickle for simplicity. For larger dataframes we rely on users to directly store the data (probably encoded as parquet) and just pickle the path instead of the whole dataframe.

Yes, you can use any python library inside a metaflow step.

Prefect is built by the Airflow core devs after they took their initial learnings and built something new. It's a reasonable orchestration engine.

Our hope with metaflow is to make the transition to production schedulers like Airflow (and perhaps similar technologies) seamless once you write the DAG via the FlowSpec. The user doesn’t have to care about the conversion to YAML etc. So I would say metaflow works in tandem with existing schedulers.

This is very interesting as a goal. Are you saying metaflow is a "Jupyter notebook for airflow developers" kind of a thing ?

I wouldn't exactly say that. Jupyter notebooks don't have an easy way to represent an arbitrary DAG. The flow is more linear and narrative like. That said, we do expect metaflow (with client API) to play very well with notebooks to support a narrative from a business use-case pov; which might be the end-goal of most ML workloads (hopefully). I would like to think of metaflow, as your workflow construct - hopefully making your life simpler with ML workloads when involving interactions with existing pieces of infrastructure (infra pieces - storage, compute, notebooks or other UI, http service hosting etc.; concepts - collaboration, versioning, archiving, dependency management)

sorry- i meant notebook figuratively.

so metaflow is the local dev version of workflow construct. and then export that to airflow/etc compatible format ?

what workflow engine do you guys use and primarily support in metaflow ?

the workflow DAG is a core concept of Metaflow but Metaflow is not just that. Other core parts are managing and inspecting state, cloud integration, and organizing results.

At Netflix, we use an internal workflow engine called Meson https://www.youtube.com/watch?v=0R58_tx7azY

so happy to learn about it. i have used airflow in the past and it seems they have addressed various pain points with this new library.

> It's a reasonable orchestration engine.

Could you elaborate, or point me at any reviews of their product. It's closed-source so much harder to learn about without learning from people that have paid for it.

Sorry to be snarky, but after all this great job, I still have to put the db password in clear text, as an environment variable...

Why not use secrets manager for this? It can even rotate the secret with not much headache.

/edit: I could have a wrapper script that reads the secret and then os.execve()...

Good point. We will address it.

> using the Metaflow's built-in S3 client, you can pull over 10Gbps, which is more than you can get with e.g. aws CLI today easily

Can you please explain how you were able to better the performance of aws cli.

you need more connections than what a single AWS CLI process open to saturate network on a big box. You can achieve the same by doing some "xargs | aws cli" trickery but then error handling becomes harder.

Our S3 client just handles multiple worker processes correctly with error handling.

What does it mean to be able to pull over 10Gbps? With one object or many objects in the same prefix?

With many objects under the same S3 bucket - say for a flow or a run (with many tasks).

This looks exciting! I'll play around with the tutorial and try to set up the AWS environment this weekend. I have several questions.

1. At what sort of scale does Metaflow become useful? Would you expect Metaflow to augment the productivity of a lone data scientist working by himself? Or is it more likely that you would need 3, 10, 25, or more data scientists before Metaflow is likely to become useful?

2. When you move to a new text editor, there are some initial frictions while you're trying to wrap your head around how things work. So, it can take some time before you become productive. Analogously, I imagine there are initial frictions when moving to Metaflow. In your experience, after Metaflow's environment has already been established, how long does it take for data scientists to get back to their initial productivity? It would be useful to have a sense of this for the data scientist who would want to sell their organization on adopting Metaflow.

3. Many data scientists work in organizations which have far less mature data infrastructure than Netflix, and/or data science needs of a much smaller scale than Netflix. In particular, I may not even have batch processing needs (e.g. a social scientist working on datasets which can be held entirely in memory). In that case, is Metaflow useful?

4. What's the closest open-source alternative to Metaflow on the market? Off the top of my head, I can't think of anything which quite matches.

1. Metaflow should best help when there is an element of collaboration - so small to medium team of data scientists. Collaborating with your self is also another scenario when Metaflow can be useful since it takes care of versioning and archiving various artifacts.

2. Keeping the language pythonic, without any additional need to learn a DSL has definitely been key to Metaflow's adoption internally. That said, this is something we are open to hearing back, esp. with this OSS launch.

3. Yes - definitely think so. Personally my favorite is the local prototyping experience part; when everything can fit in memory and is blazing fast. There is an also an open issue for fast-data access, which you can upvote if interested in seeing it open-sourced.

4. We don't think there is an exact equivalent as well. :)

Can you specifically compare Metaflow to DVC and Databricks MLFlow? Those seem to be some popular tools in this space right now?

Re 4, aren't Kubeflow and Lyft's recently open-sourced "Flyte" pretty similar?

If you don't consider them basically equivalent, what would you say are the key differences?

Thanks for pinging on this.

re: Kubeflow - imho it is quite coupled to Kubernetes. We don’t intend to be tied to a specific compute substrate even though the first launch is with AWS. We do follow a plugin architecture - so I’m hoping Kube happens sometime.

re: Flyte - I’m less informed on this but happy to educate myself and get back.

Good overview of Flyte found here. https://www.youtube.com/watch?v=KdUJGSP1h9U It does appear to be quite similar, though it has native k8s integration and a central web-based UI for monitoring jobs. Flyte asks the user to turn on caching. I like that Metaflow does that for you by default.

That's true of Kubeflow. I'm not sure that project will be as keen on being as "compute substrate" agnostic as Metaflow too, given its connection with Google.

If you feel inclined jump in the Flyte Slack and share your thoughts :). At my company we're on Kubeflow/Argo now, but things are developing quite a lot in this space so keen to not be myopic.

Thanks for sharing the context. Hopefully we can have a (fast) follow up with Kube integration depending on demand.

re: 3. We have an optimized S3 client as part of this release - https://docs.metaflow.org/metaflow/data#data-in-s-3-metaflow...

hey, I'm one of the authors of Metaflow. Happy to answer any questions! Netflix has been using Metaflow internally for about two years, so we have many war stories :)

Hi Ville, I cannot express how happy I am that you open-sourced Metaflow. I have three questions: 1) Do you know ETA of R API release? 2) Would you recommend using both Metaflow nad MLFlow in projects? Could you please explain why yes or why no :) 3) Do you plan to release integration with Spark/Yarn?

Thanks in advance, Roko

Thanks for open sourcing this.

Can you say a little about which niche this would occupy, and what the motivation is? Is it intended to compete with Tensorflow and Pytorch or to be an industrial strength version of SKlearn.

I looked through the tutorial on my mobile and the answer was not immediately clear.

Is the benefit that it auto scales on AWS without having to think through the infrastructure?

Hi omarhaneef, We don't intend to compete with Tensorflow, PyTorch, SKLearn. What we offer is a way to iterate and productionize your models written using any of the aforementioned libraries (and more). https://docs.metaflow.org/introduction/what-is-metaflow contains further elaboration of our philosophy. Auto-scaling infrastructure is one piece of the puzzle, and Metaflow goes beyond that offering a comprehensive solution for model management.

Hey I meant to track one of y'all down at the MLOps conference, but didn't get the chance. I've built a very shitty version of a cached-execution DAG thing internally, and one of the design decisions I made was to have it so that parent nodes in the DAG don't need to know anything about child nodes. This allows for larger DAG builders to be more easily subclassed.

MetaFlow doesn't do that -- instead each 'step' has to know what to call next, which means that if I wanted to subclass e.g. the MovieStatsFlow in [here](https://github.com/Netflix/metaflow/blob/master/metaflow/tut...) and say, add some sort of input pre-processing before the compute_statistics call, I'd essentially end up having to either override what compute_statistics does to not match its name _or_ copy-past e that first step just to replace that last line.

I'm sure this design decision was considered and/or that use-case doesn't come up a lot at Netflix (although I've encountered that a lot), or maybe I'm missing something very obvious, but I'd love to hear your thoughts on that.

We erred on the side of simplicity to keep things manageable for our users.

That seems more complicated.

I have been looking for something exactly like this, but I use GCP not AWS. Is there a way to deploy outside of AWS? What would be involved in getting it to work with a different cloud?

All of our integrations are driven by plugins. We are exploring our roadmap with regards to integrations with other clouds.

Does this do experiment tracking similar to MLflow? I’m trying to figure where these two overlap and where they diverge.

yep, Metaflow tracks everything: Your code, dependencies, and the internal state of the workflow automatically.

A big difference between Metaflow and other workflow frameworks for ML is that Metaflow doesn't only execute your DAG, it helps you to design and implement the code that runs inside the DAG. Many frameworks leave these details to the data scientist to decide.

The centralized DAG scheduler seems like a pretty important part. How much will not having Meson hamper the usability?

Hi bhtucker, We do have plans to integrate with production-grade schedulers in the very near future. https://github.com/Netflix/metaflow/issues/2

note that Metaflow comes with a built-in DAG scheduler, so it is perfectly usable without a centralized scheduler like Meson.

Integrations to Airflow or StepFunctions are on the roadmap, depending on what seems to resonate with people outside Netflix.

Thank you for sharing this. Would this be useful for me if I only need the deployment management part? Don't really need to track experiments, just looking for an easy way to deploy my models to Fargate.

Yes indeed. You can access your model via the Metaflow client inside your service - https://docs.metaflow.org/metaflow/client. We are looking into releasing the hosting component on Metaflow in the future - https://github.com/Netflix/metaflow/issues/3

G'day, seems like an cool tool, thanks - the links to the github tuts are currently broken...

I'm of the opinion that adopting some standard DAG meta format for data science may make a positive impact on the reproduceability issues we have in science generally. So its good to see the idea has real world merit as well.

Hi ageofwant, Thanks for taking a look at Metaflow. We will fix these links. Can you point us to the offending links?

I have updated the links.

Pray you do not update them further?


This is pretty off-topic, and you're asking an infrastructure engineer who is not going to be able to answer this.

No, it is not offtopic in the slightest...especially in response to an "author". The underlying questions remain obvious. Is metaflow not used in netflix suggestions? What utility does it offer that other tooling doesnt or at least how has netflix extracted value? These are interesting questions.

I replied to someone complaining about their Netflix movie recommendations being bad. Did you think I was replying to you?

My team has a similar library called Loman, which we open-sourced. Instead of nodes representing tasks, they represent data, and the library keeps track of which nodes are up-to-date or stale as you provide new inputs or change how nodes are computed. Each node is either an input node with a provided value, or a computed node with a function to calculate its value. Think of it as a grown-up Excel calculation tree. We've found it quite useful for quant research, and in production it works nicely because you can serialize entire computation graph which gives an easy way to diagnose what failed and why in hundreds of interdependent computations. It's also useful for real-time displays, where you can bind market and UI inputs to nodes and calculated nodes back to the UI - some things you want to recalculate frequently, whereas some are slow and need to happen infrequently in the background.

[1] Github: https://github.com/janushendersonassetallocation/loman

[2] Docs: https://loman.readthedocs.io/en/latest/

[3] Examples: https://github.com/janushendersonassetallocation/loman/tree/...

Is there a reason to use this over DVC[1] which is language and framework agnostic and supports a large number of storage backends? It works with any git repo and even polyglot implementations and can run the DAG on any system.

Currently using DVC, MLflow just for metadata visualization and notes on experiments, and Anaconda for (python) dependency management. We are an embedded shop so we don't deploy to the "cloud."

[1]: https://dvc.org/

We are good friends with the DVC folks! If the DVC + MLFlow + Anaconda stack works for you, that's great. Metaflow provides similar features. The cloud integration is really important at Netflix's scale.

I am disappointed that when I click on documentation, "why metaflow," I get a bunch of cartoony BS instead of a simple text explanation. Glad these folks don't write RFC'S.

Edit: just went to the Amazon CodeGuru homepage. Fantastic! Wish they were all like that.

Can anybody provide a good comparison e.g. with Meltano?

I am not affiliated with the Meltano people, but I like the idea of keeping the system modular, what seems to make it easier to replace components.

I have no doubt that we will see better replacements for every component of a data pipeline in the coming years. If there is only one thing to do right, then it´s to not bet on one tool but keep the whole stack flexible.

I am still missing well established standards for data formats, workflow definitions and project descriptions - hopefully open source ninjas will deliver on this front before proprietary pirats will destroy the field with progress-inhibiting closed things. It seems to be too late to create an "Autocad" or "Word" file format for datascience, but I see no clear winner atm, but hopefully my sight is bad - please enlighten me!

I am not familiar with Meltano, sorry.

btw, if you happen to be at AWS Reinvent right now, you can get a stylish, collector's edition Metaflow t-shirt if you drop by at the Netflix booth at the expo hall and/or ping us otherwise!

Seems like a cool addition to the DAG ML tooling family. Thanks for sharing! Do you support, or plan to support, features commonly found in data science platform tools like Domino (https://www.dominodatalab.com/)? I'm thinking of container management, automatic publishing of web apps and API endpoints, providing a search for artifacts like code or projects, etc.

Good question. We have many common features covered:

- Container management: See https://docs.metaflow.org/metaflow/dependencies

- Search for artifacts: see https://docs.metaflow.org/metaflow/client

- Automatic publishing of web apps: we have this internally but it is not open-source yet. If it interests you, react to this issue (https://github.com/Netflix/metaflow/issues/3)

Let us know if you notice any other interesting features missing! Feel free to open a GitHub issue or reach out to us on http://chat.metaflow.org

The link in the docs to the CloudFormation template source is broken: https://docs.metaflow.org/metaflow-on-aws/deploy-to-aws#clou... Instead of /Netflix/metaflow-tools/aws it should probably be /Netflix/metaflow-tools/tree/master/aws

Thanks for reporting it. We ll fix it. Sorry for the inconvenience.

No worries, on the whole the documentation is top notch.


Wow, wasn't expecting that quick of a turnaround.

i looked over the tutorials and curious to know, whether the tutorials are representative of how Netflix does ML ?

is data really being read in .csv format and processed in memory with pandas ?

because I see "petabytes of data" being thrown everywhere, and i am just trying to understand how one can read gigabytes in .csv process do simple stats like grouping by in pandas - shouldn't simple SQL DWH do the same thing more efficiently with partitioned tables, clustered indexes and the power of SQL language ?

i would love to take a look at one representative ML pipeline (even with masked names of datasets, features) just to see how "terabytes" of data get processed into a model

Good question! A typical Metaflow workflow at Netflix starts by reading data from our data warehouse, either by executing a (Spark)SQL query or by fetching Parquet files directly from S3 using the built-in S3 client. We have some additional Python tooling to make this easy (see https://github.com/Netflix/metaflow/issues/4)

After the data is loaded, there are bunch of steps related to data transformations. Training happens with an off-the-shelf ML library like Scikit Learn or Tensorflow for training. Many workflows train a suite of models using the foreach construct.

The results can be pushed to various other systems. Typically they are either pushed to another table or as a microservice (see https://github.com/Netflix/metaflow/issues/3)

This looks like a fantastically clean API for Python data and ML pipelines. Congratulations!

It would be great to have a scheduler and monitoring UI that are equally lightweight.

Metaflow comes with a built in scheduler. If your company has an existing scheduler, e.g. for ETL, you can translate Metaflow DAGs to the production scheduler automatically. This is what we do at Netflix.

We could provide similar support e.g. for Airflow, if there's interest.

For monitoring, we have relied on notebooks this far. Their flexibility and customizability is unbeatable. We might build some lightweight discovery / debugging UI later, if there's demand. We are a bit on the fence about it internally.

> For monitoring, we have relied on notebooks this far.

This is pretty interesting approach. Notebooks are a natural environment of Data Scientist and for monitoring they'd provide really good discoverability, flexibility, and interactivity.

Do you use more traditional monitoring that will alert you somehow if a workflow fails, or a workflow hasn't run at all for X hrs/days?

At Netflix, we rely on the workflow scheduler for such alerting and bundle in a layer of triggering mechanism (custom notifications and such)

> Metaflow comes with a built in scheduler.

This isn't a scheduler that will trigger 'flows' though is it? If I search "Scheduler" in your docs the top result is a roadmap item, and searching "Cron" turns up nothing.

Is it true that right now that to run a DAG "every day at 3am UTC" requires an external service?

Right now, yes.

Thanks for confirmation.

It's so simple and intuitive to run two steps in parallel. Thank you, Netflix!

You’re welcome! :)

How does this compare to snakemake[1] and nextflow[2]?

[1] https://snakemake.readthedocs.io/en/stable/ [2] https://www.nextflow.io/

The fact that metaflow works directly in Python piques my interest. I can lint it, I can test it, I can format it, I can easily extend it.

I've been hesitant to commit myself and my collaborators to yet another DSL -- and that's part of why I haven't seen much to offer in snakemake and nextflow.

Yes - that’s our thinking too. Compilers finding your typos for variable names seems helpful for user productivity.

I am not familiar with them.

Very interesting project. I love that this allows you to transparently switch "runtime" from local to cloud, like spark does, but integrated with common python tools like sklearn/tf etc. Looking forward to test metaflow out myself.

Thanks. Let us know how you like the prototyping -> scaling out & up journey.

At first glance, I see BASIC's GOTO statement.

How does it compare to dragster.io?


Orchestrating a workflow, which is what Dagster does, is just one part of Metaflow. Other important parts are dependency management, cloud integration, state transfer, inspecting and organizing results - features that are central to data science workflows.

Metaflow helps data scientists build and manage data science workflows, not just execute a DAG.

this seems like it's very similar to http://metaflow.fr is there any relation, or is this a name collision?

no relation

We are on Azure using Spark via Databricks. We had to abandon sci kit learn because of this choice. Does your service require AWS and can it be used in conjunction with Spark? Thank you for your time and consideration.

We currently provide integrations with AWS (S3 and Batch) and it is easy to extend Metaflow to work with other cloud providers. https://docs.metaflow.org/internals-of-metaflow/technical-ov...

What's the reason you need to abandon scikit-learn? You can run scikit-learn on Databricks, and many of our customers do.

Disclaimer: Databricks cofounder.

What about databricks made you abandon sklearn?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact