Here's my understanding:
- It's a python library for creating & executing DAGs
- Each node is a processing step & the results are stored after each step so you can restart failed workflows from where it failed
- Tight integration with AWS ECS to run the whole DAG on cloud
I don't know why their .org site oddly feels like a paid SaaS tool. Anyway, thank you Netflix for open sourcing Metaflow.
It also is very opinionated about dependency management (Conda-only) and is Python-only, where Airflow I think has operators to run arbitrary containers. So Metaflow is a non-starter I think if you don't want to exclusively use Python.
Airflow also ships with built-in scheduler support (Celery?) or can run on K8s. Metaflow doesn't have this. Seems to rely on AWS Batch for production DAG execution.
Airflow ships with a pretty rich UI. Metaflow seems to be anti-UI, and provides a novel Notebook-oriented workflow interaction model.
Metaflow has pretty nice code artifact + params snapshotting functionality which is a core selling point. Airflow is not as supportive of this so it's harder to do reproducibility (I think). This is encapsulated by their "Datastore" model which can locally or in S3 persist flow code, config and data.
I wouldn’t qualify metaflow as anti-UI. For model monitoring, we haven’t found a good enough UI that can handle the diversity of models and use cases we see internally, and believe that notebooks are an excellent visualisation medium that gives the power to the end user (data scientists) to craft dashboards as they see fit. For tracking the execution of production runs, we have historically relied on the UI of the scheduler itself (meson). We are exploring what a metaflow-specific UI might look like.
As for comparisons with Airflow, it is an excellent production grade scheduler. Metaflow intends to solve a different problem of providing an excellent development and deployment experience for ML pipelines.
> Our execution model also supports arbitrary docker containers (on AWS batch) where you can theoretically bake in your own dependencies.
That's fair, but it doesn't seem to be something encouraged by the framework, and that's fine.
> I wouldn’t qualify metaflow as anti-UI.
Maybe anti-UI is too strong yeah. I personally think your approach could be great. Looking forward to exploring it.
For instance, this tutorial example here (https://github.com/Netflix/metaflow/blob/master/metaflow/tut...) does not look substantially different to what I could achieve just as easily in R, or other Python data wrangling frameworks.
Is the main feature the fact I can quickly put my workflows into the cloud?
- Metaflow snapshots your code, data, and dependencies automatically in a content-addressed datastore, which is typically backed by S3, although local filesystem is supported too. This allows you to resume workflows, reproduce past results, and inspect anything about the workflow e.g. in a notebook. This is a core feature of Metaflow.
- Metaflow is designed to work well with a cloud backend. We support AWS today but technically other clouds could be supported too. There's quite a bit of engineering that has gone into building this integration. For instance, using the Metaflow's built-in S3 client, you can pull over 10Gbps, which is more than you can get with e.g. aws CLI today easily.
- We have spent time and effort in keeping the API surface area clean and highly usable. YMMV but it has been an appealing feature to many users this far.
Hope this makes sense!
I am not familiar with Prefect.
It is conceivable to execute a Flow entirely locally yet achieve @step-wise "Batch-like" parallelism/distributed computation by, in the relevant @step's, `import dask` and use it as you would outside of Metaflow, correct?
Although, as I think of it, the `parallel_map` function would achieve much of what Dask offers on a single box, wouldn't it? But within a @step, using dask-distributed could kinda replicate something a little more akin to the AWS integration?
Tangentially related, but the docs note that data checkpointing is achieved using pickle. I've never compared them, but I've found parquet files to be extremely performant for pandas dataframes. Again, I'm assuming a lot here, but I'd expect @step results to be dataframes quite often. What was the design consideration associated with how certain kinds of objects get checkpointed?
To be clear, the fundamental motivation behind these lines of questioning is: how can I leverage Metaflow in conjunction with existing python-based _on-prem_ distributed (or parallel) computing utilities, e.g. Dask? In other words, can I expect to leverage Metaflow to execute batch ETL or model-building jobs that require distributed compute that isn't owned by $cloud_provider?
As an aside: hot damn, the API - https://i.kym-cdn.com/photos/images/newsfeed/000/591/928/94f... Really cool product. Thanks for all the effort poured into this by you and others.
Your first part of the understanding matches my expectation too.
Dask single box parallelism achieved by multi processing - akin to parallel map.
And distributed compute is achieved by shipping the work to remote substrates.
For your second comment - we leverage pickle mostly to keep it easy and simple for basic types.
For small dataframes we just pickle for simplicity. For larger dataframes we rely on users to directly store the data (probably encoded as parquet) and just pickle the path instead of the whole dataframe.
so metaflow is the local dev version of workflow construct. and then export that to airflow/etc compatible format ?
what workflow engine do you guys use and primarily support in metaflow ?
At Netflix, we use an internal workflow engine called Meson https://www.youtube.com/watch?v=0R58_tx7azY
Could you elaborate, or point me at any reviews of their product. It's closed-source so much harder to learn about without learning from people that have paid for it.
Why not use secrets manager for this? It can even rotate the secret with not much headache.
/edit: I could have a wrapper script that reads the secret and then os.execve()...
Can you please explain how you were able to better the performance of aws cli.
Our S3 client just handles multiple worker processes correctly with error handling.
1. At what sort of scale does Metaflow become useful? Would you expect Metaflow to augment the productivity of a lone data scientist working by himself? Or is it more likely that you would need 3, 10, 25, or more data scientists before Metaflow is likely to become useful?
2. When you move to a new text editor, there are some initial frictions while you're trying to wrap your head around how things work. So, it can take some time before you become productive. Analogously, I imagine there are initial frictions when moving to Metaflow. In your experience, after Metaflow's environment has already been established, how long does it take for data scientists to get back to their initial productivity? It would be useful to have a sense of this for the data scientist who would want to sell their organization on adopting Metaflow.
3. Many data scientists work in organizations which have far less mature data infrastructure than Netflix, and/or data science needs of a much smaller scale than Netflix. In particular, I may not even have batch processing needs (e.g. a social scientist working on datasets which can be held entirely in memory). In that case, is Metaflow useful?
4. What's the closest open-source alternative to Metaflow on the market? Off the top of my head, I can't think of anything which quite matches.
2. Keeping the language pythonic, without any additional need to learn a DSL has definitely been key to Metaflow's adoption internally. That said, this is something we are open to hearing back, esp. with this OSS launch.
3. Yes - definitely think so. Personally my favorite is the local prototyping experience part; when everything can fit in memory and is blazing fast.
There is an also an open issue for fast-data access, which you can upvote if interested in seeing it open-sourced.
4. We don't think there is an exact equivalent as well. :)
If you don't consider them basically equivalent, what would you say are the key differences?
re: Kubeflow - imho it is quite coupled to Kubernetes. We don’t intend to be tied to a specific compute substrate even though the first launch is with AWS. We do follow a plugin architecture - so I’m hoping Kube happens sometime.
re: Flyte - I’m less informed on this but happy to educate myself and get back.
If you feel inclined jump in the Flyte Slack and share your thoughts :). At my company we're on Kubeflow/Argo now, but things are developing quite a lot in this space so keen to not be myopic.
Thanks in advance,
Can you say a little about which niche this would occupy, and what the motivation is? Is it intended to compete with Tensorflow and Pytorch or to be an industrial strength version of SKlearn.
I looked through the tutorial on my mobile and the answer was not immediately clear.
Is the benefit that it auto scales on AWS without having to think through the infrastructure?
MetaFlow doesn't do that -- instead each 'step' has to know what to call next, which means that if I wanted to subclass e.g. the MovieStatsFlow in [here](https://github.com/Netflix/metaflow/blob/master/metaflow/tut...) and say, add some sort of input pre-processing before the compute_statistics call, I'd essentially end up having to either override what compute_statistics does to not match its name _or_ copy-past e that first step just to replace that last line.
I'm sure this design decision was considered and/or that use-case doesn't come up a lot at Netflix (although I've encountered that a lot), or maybe I'm missing something very obvious, but I'd love to hear your thoughts on that.
A big difference between Metaflow and other workflow frameworks for ML is that Metaflow doesn't only execute your DAG, it helps you to design and implement the code that runs inside the DAG. Many frameworks leave these details to the data scientist to decide.
Integrations to Airflow or StepFunctions are on the roadmap, depending on what seems to resonate with people outside Netflix.
I'm of the opinion that adopting some standard DAG meta format for data science may make a positive impact on the reproduceability issues we have in science generally. So its good to see the idea has real world merit as well.
 Github: https://github.com/janushendersonassetallocation/loman
 Docs: https://loman.readthedocs.io/en/latest/
 Examples: https://github.com/janushendersonassetallocation/loman/tree/...
Currently using DVC, MLflow just for metadata visualization and notes on experiments, and Anaconda for (python) dependency management. We are an embedded shop so we don't deploy to the "cloud."
Edit: just went to the Amazon CodeGuru homepage. Fantastic! Wish they were all like that.
I am not affiliated with the Meltano people, but I like the idea of keeping the system modular, what seems to make it easier to replace components.
I have no doubt that we will see better replacements for every component of a data pipeline in the coming years. If there is only one thing to do right, then it´s to not bet on one tool but keep the whole stack flexible.
I am still missing well established standards for data formats, workflow definitions and project descriptions - hopefully open source ninjas will deliver on this front before proprietary pirats will destroy the field with progress-inhibiting closed things. It seems to be too late to create an "Autocad" or "Word" file format for datascience, but I see no clear winner atm, but hopefully my sight is bad - please enlighten me!
- Container management: See https://docs.metaflow.org/metaflow/dependencies
- Search for artifacts: see https://docs.metaflow.org/metaflow/client
- Automatic publishing of web apps: we have this internally but it is not open-source yet. If it interests you, react to this issue (https://github.com/Netflix/metaflow/issues/3)
Let us know if you notice any other interesting features missing! Feel free to open a GitHub issue or reach out to us on http://chat.metaflow.org
is data really being read in .csv format and processed in memory with pandas ?
because I see "petabytes of data" being thrown everywhere, and i am just trying to understand how one can read gigabytes in .csv process do simple stats like grouping by in pandas - shouldn't simple SQL DWH do the same thing more efficiently with partitioned tables, clustered indexes and the power of SQL language ?
i would love to take a look at one representative ML pipeline (even with masked names of datasets, features) just to see how "terabytes" of data get processed into a model
After the data is loaded, there are bunch of steps related to data transformations. Training happens with an off-the-shelf ML library like Scikit Learn or Tensorflow for training. Many workflows train a suite of models using the foreach construct.
The results can be pushed to various other systems. Typically they are either pushed to another table or as a microservice (see https://github.com/Netflix/metaflow/issues/3)
It would be great to have a scheduler and monitoring UI that are equally lightweight.
We could provide similar support e.g. for Airflow, if there's interest.
For monitoring, we have relied on notebooks this far. Their flexibility and customizability is unbeatable. We might build some lightweight discovery / debugging UI later, if there's demand. We are a bit on the fence about it internally.
This is pretty interesting approach. Notebooks are a natural environment of Data Scientist and for monitoring they'd provide really good discoverability, flexibility, and interactivity.
Do you use more traditional monitoring that will alert you somehow if a workflow fails, or a workflow hasn't run at all for X hrs/days?
This isn't a scheduler that will trigger 'flows' though is it? If I search "Scheduler" in your docs the top result is a roadmap item, and searching "Cron" turns up nothing.
Is it true that right now that to run a DAG "every day at 3am UTC" requires an external service?
I've been hesitant to commit myself and my collaborators to yet another DSL -- and that's part of why I haven't seen much to offer in snakemake and nextflow.
Metaflow helps data scientists build and manage data science workflows, not just execute a DAG.
Disclaimer: Databricks cofounder.