Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: DAGWorks – ML platform for data science teams
182 points by krawczstef on March 7, 2023 | hide | past | favorite | 65 comments
Hey HN! We’re Stefan and Elijah, co-founders of DAGWorks (https:///www.dagworks.io). We’re on a mission to eliminate the insane inefficiency of building and maintaining ML pipelines in production.

DAGWorks is based on Hamilton, an open-source project that we created and recently forked (https://github.com/dagworks-inc/hamilton). Hamilton is a set of high-level conventions for Python functions that can be automatically converted into working ETL pipelines. To that, we're adding a closed-source offering that goes a step further, plugging these functions into a wide array of production ML stacks.

ML pipelines consist of computational steps (code + data) that produce a working statistical model that a business can use. A typical pipeline might be (1) pull raw data (Extract), (2) transform that data into inputs for the model (Transform), (3) define a statistical model (Transform), (4) use that statistical model to predict on another data set (Transform) and (5) push that data for downstream use (Load). Instead of “pipeline” you might hear people call this “workflow”, “ETL” (Extract-Transform-Load), and so on.

Maintaining these in production is insanely inefficient because you need both data scientists and software engineers to do it. Data scientists know the models and data, but most can't write the code needed to get things working in production infrastructure—for example, a lot of mid-size companies out there use Snowflake to store data, Pandas/Spark to transform it, and something like databrick's MLFlow to handle model serving. Engineers can handle the latter, but mostly aren't experts in the ML stuff. It's a classic impedance mismatch, with all the horror stories you'd expect—e.g. when data scientists make a change, engineers (or data scientists who aren’t engineers) have to manually propagate the change in production. We've talked to teams who are spending as much as 50% of their time doing this. That's not just expensive, it's gruntwork—those engineers should be working on something else! Basically, maintaining ML pipelines over time sucks for most teams.

One way out is to hire people who combine both skills, i.e. data scientists who can also write production code. But these are rare and expensive, and in our experience they usually are only expert at one side of the equation and not as good at the other.

The other way is to build your own platform to automatically integrate models + data into your production stack. That way the data scientists can maintain their own work without needing to hand things off to engineers. However, most companies can't afford to make this investment, and even for the ones that can, such in-house layers tend to end up in spaghetti code and tech debt hell, because they're not the company's core product.

Elijah and I have been building data and ML tooling for the last 7 years, most recently at Stitch Fix, where we built a ML platform that served over 100 data scientists from various modeling disciplines (some of our blog posts, like [1], hit the front page of HN - thanks!). We saw first hand the issues teams encountered with ML pipelines.

Most companies running ML in production need a ratio of 1:1 or 2:1 data scientists to engineers. At bigger companies like Stitch Fix, the ratio is more like 10:1—way more efficient—because they can afford to build the kind of platform described above. With DAGWorks, we want to bring the power of an intuitive ML Pipeline platform to all data science teams, so a ratio of 1:1 is no longer required. A junior data scientist should be able to easily and safely write production code without deep knowledge of underlying infrastructure.

We decided to build our startup around Hamilton, in large part due to the reception that it got here [2] - thanks HN! We came up with Hamilton while we were at Stitch Fix (note: if you start an open-source project at an employer, we recommend forking it right away when you start a company. We only just did that and left behind ~900 stars...). We are betting on it being our abstraction layer to enable our vision of how to go about building and maintaining ML pipelines, given what we learned at Stitch Fix. We believe a solution has to have an open source component to be successful (we invite you to check out the code). In terms of why the name DAGWorks? We named the company after Directed Acyclic Graphs because we think the DAG representation, which Hamilton also provides, is key.

A quick primer on Hamilton. With Hamilton we use a new paradigm in Python (well not quite “new” as pytest fixtures use this approach) for defining model pipelines. Users write declarative functions instead of writing procedural code. For example, rather than writing the following pandas code:

  df['col_c'] = df['col_a'] + df['col_b']
You would write:

  def col_c(col_a: pd.Series, col_b: pd.Series) -> pd.Series:
       """Creating column c from summing column a and column b."""
       return col_a + col_b
Then if you wanted to create a new column that used `col_c` you would write:

  def col_d(col_c: pd.Series) -> pd.Series:
       # logic
These functions then define a "dataflow" or a directed acyclic graph (DAG), i.e. we can create a “graph” with nodes: col_a, col_b, col_c, and col_d, and connect them with edges to know the order in which to call the functions to compute any result. Since you’re forced to write functions, everything becomes unit testable and documentation friendly, with the ability to display lineage. You can kind of think of Hamilton as "DBT for python functions", if you know what DBT is. Have we piqued your interest? Want to go play with Hamilton? We created https://www.tryhamilton.dev/ leveraging pyodide (note it can take a while to load) so you can play around with the basics without leaving your browser - it even works on mobile!

What we think is cool about Hamilton is that you don’t need to specify an “explicit pipeline declaration step”, because it’s all encoded in the function and parameter names! Moreover, everything is encapsulated in functions. So from a framework perspective, if we wanted to (for example) log timing information, or introspect inputs/outputs, delegate the function to Dask or Ray, we can inject that at a framework level, without having to pollute user code. Additionally, we can expose "decorators" (e.g. @tag(...)) that can specify extra metadata to annotate the DAG with, or for use at run time. This is where our DAGWorks Platform fits in, providing off-the-shelf closed source extras in this way.

Now, for those of you thinking there’s a lot of competition in this space, or what we’re proposing sounds very similar to existing solutions, here’s some thoughts to help distinguish Hamilton from other approaches/technology: (1) Hamilton's core design principle is helping people write more maintainable code; at a nuts and bolts level, what Hamilton replaces is procedural code that one would write. (2) Hamilton runs anywhere that python runs: notebook, a python script, within airflow, within your python web service, pyspark, etc. E.g. People use Hamilton for executing code in batch tasks and online web services. (3) Hamilton doesn't replace a macro orchestration system like airflow, prefect, dagster, metaflow, zenML, etc. It runs within/uses them. Hamilton helps you not only model the micro - e.g. feature engineering - but can also help you model the macro - e.g. model pipelines. That said, given how big machines are these days, model pipelines can commonly run on a single machine - Hamilton is perfect for this. (4) Hamilton doesn't replace things like Dask, Ray, Spark -- it can run on them, or delegate to them. (5) Hamilton isn't just for building dataframes, though it’s quite good for that, you can model any python object creation with it. Hamilton is data type agnostic.

Our closed source offering is currently in private beta, but we'd love to include you in it (see next paragraph). Hamilton is free to use (BSD-3 license) and we’re investing in it heavily. We’re still working through pricing options for the closed source platform; we think we’ll follow the leads of others in the space like Weights & Biases, and Hex.tech here in how they price. For those interested, here’s a video walkthrough of Hamilton, which includes a teaser of what we’re building on the closed source side - https://www.loom.com/share/5d30a96b3261490d91713a18ab27d3b7.

Lastly, (1) we’d love feedback on Hamilton (https://github.com/dagworks-inc/hamilton) and on any of the above, and what we could do better. To stress the importance of your feedback, we’re going all-in on Hamilton. If Hamilton fails, DAGWorks fails. Given that Hamilton is a bit of a “swiss army knife” of what you could do with it, we need help prioritizing features. E.g. we just released experimental PySpark UDF map support, is that useful? Or perhaps you have streaming feature engineering needs where we could add better support? Or you want a feature to auto generate unit test stubs? Or maybe you are doing a lot of time-series forecasting and want more power features in Hamilton to help you manage inputs to your model? We’d love to hear from you! (2) For those interested in the closed source DAGWorks Platform, you can sign up for early access via www.dagworks.io (leave your email, or schedule a call with me) – we apologize for not having a self-serve way to onboard just yet. (3) If there’s something this post hasn’t answered, do ask, we’ll try to give you an answer! We look forward to any and all of your comments!

[1] https://news.ycombinator.com/item?id=29417998

[2] https://news.ycombinator.com/item?id=29158021




Data scientist here, stuck in the Dark Ages of "deploying" my models by writing bespoke Python apps that run on some kind of cloud container host like ECS. Dump the outputs to blob storage and slurp them back into the data warehouse nightly using Airflow. Lots of manual fussing around.

What the heck are all these ML and data platforms, how do they benefit me, and how do I evaluate the gazillion options that seem to be out there?

For example, I recently came across DStack (https://dstack.ai/) and have had an open browser tab sitting around waiting for me to figure out WTF it even does. DAGWorks seems like it does something similar. Is that true? Are these tools even comparable? How would I choose one or the other? Is there overlap with MLFlow?


TL;DR -- ML platforms solve for a million different problems, and most people don't have all (or any) of them. Hamilton is a pretty simple way of organizing code, so it can plug in with a bunch of different approaches (and that's why we think its general purpose).

> Data scientist here, stuck in the Dark Ages of "deploying" my models by writing bespoke Python apps that run on some kind of cloud container host like ECS. Dump the outputs to blob storage and slurp them back into the data warehouse nightly using Airflow. Lots of manual fussing around.

Oh, man, been there! So, first, I want to say there are a lot of ML/data platforms -- largely because there are so many problems to solve, and they're not one-size-fits-all solutions.

> What the heck are all these ML and data platforms, how do they benefit me, and how do I evaluate the gazillion options that seem to be out there?

You probably don't need all of them, or all that many. As in everything, it depends on your pain-points. Given that you have a lot of manual fussing around, you probably want something to reduce it. We've found airflow to be painful, but a lot of dev teams/DS have airflow already integrated into their platform, so we wanted to build something that allows data scientists to plug into it. So for people who don't like airflow, the idea is that you could express your dataflow in Hamilton and DAGWorks can ship it to airflow (or any other orchestration system).

> For example, I recently came across DStack (https://dstack.ai/) and have had an open browser tab sitting around waiting for me to figure out WTF it even does. DAGWorks seems like it does something similar. Is that true? Are these tools even comparable? How would I choose one or the other? Is there overlap with MLFlow?

DStack is definitely a different approach, similar space. Hamilton is organized around python functions and dstack is more of a high-level workflow spec (reminds me of something we had at my old company). So Hamilton can model the "micro" of your workflow, whereas DStack models the "macro" -- managing artifacts. DStack could easily run Hamilton functions. What nothing we've found out there do (except perhaps kedro) is model the "micro" -- E.G. the specific fine-grained dependencies so you can take a look at your code and figure out how exactly it works.

Re: MLFlow -- DAGWorks + MLFlow are pretty natural connectors. Hamilton functions can produce a model that DAGWorks would be able to save to mlflow. DAGWorks is more on the data transform side, and doesn't explicitly say how to represent a model.


>What nothing we've found out there do (except perhaps kedro) is model the "micro" -- E.G. the specific fine-grained dependencies so you can take a look at your code and figure out how exactly it works.

Just wondering then @elijahbenizzy - how does Hamilton differ from Kedro?


Ha! Kedro maintainer just joined on the hamilton slack, and we were talking it over :) here’s what I was thinking:

TL;DR hamilton is lighter weight and less opinionated about non-pipeline stuff, it also has a different way of specifying pipelines (that we prefer).

I think there are a few key differences in the approach: - currently, hamilton is lighter weight and less opinionated about directory structure/style guide. It’s just a library! - Kedro pipelining (from what I understand) has you define the nodes separately to specifying their inputs, whereas in hamilton it’s function-first, and the functions specify everything. It’s funny — Kedro is actually very similar to the framework I first designed to solve the problem and compared with @Stefan Krawczyk’s idea (that became hamilton), I called it “burr” - Kedro has a whole bunch of additional features that allow it to integrate with the outside world and hamilton is pretty lighter weight here (although we’ll likely be adding more)


Ah, cool, thanks for the clarification. Good luck with it all, and congrats on the launch!


Thank you!


Hey, the founder of dstack here. If I put it shortly, dstack allows you to define ML workflows as code and run them either locally or remotely (e.g. in a configured cloud). ML workflows here mean anything that you may want to do when you're developing a model - prepping data, training or finetuning a model, etc. The value - basically it automates running your workflows, without being dependant on any particular vendor. At the same time, you don't have to rewrite your Python scripts to use a particular API (because dstack is using YAML).

We aim to build the most easy tool to run ML workflows - without making you use the UI of any vendor, or hustling with Kubernetes, custom Docker images, etc.

MLFlow doesn't do what dstack does (automatic infrastructure provisioning) - unless you use Databricks.


Nice! We had a similar abstraction at Stitch Fix.


> bespoke Python apps that run on some kind of cloud container host like ECS.

For this, you could use for example https://flyte.org/, if you have platform engineers that could set it up for you. But I don't think their website makes a good job of explaining what it is and what it does.


yep some of these ML platform tools only make sense once you reach a certain scale/set of problems. For flyte, it was created because Lyft had a very heterogeneous set of tools one could build a ML pipeline with. So they made it really easy to integrate and importantly serialize, data between systems. E.g. sql -> python -> spark -> python. But if all your data fits in memory, or you only one system, you might not be a great fit for Flyte.

What I try to do, is understand who created the platform, and understand the environment of that company. Then that will give you a better idea of whether it makes sense for you or not.


I do agree it makes sense fornscale, but if your data fits in memory Flyte native constructs shine. For example it will ensure your data is stored / serialized correctly. Allows you to use polars, vaex, duckdb etc. tbh I am a huge proponent of vertical scaling till you can get the mileage.

It also supports - gpu allocation, spot instances and collaboration across multiple users. I do not think it is a wrong choice if you feel your complexity will grow.

PS. I am a maintainer of the project


Yep - it's on my TODO to get https://github.com/flyteorg/flyte/issues/2627 over the line!


Thank you for sharing. If you do not have platform engineers look at union.ai. They offer a manager version of Flyte


In my experience building the pipeline and related infrastructure is not trivial, but it’s also a relatively tiny problem compared to, well, everything else. That is, acquiring and moving data around, managing the data over the lifetime of a model’s use, and serving adjacent needs (e.g. post-deployment analytics). How does DAGWorks help with all the rest of this stuff?


To give you another take on Elijah's answer:

> acquiring and moving data around,

yep with Hamilton we provide the ability to cleanly separate bits of logic that's required to change and update this. For example you'd write "data loader functions/modules" that are implementations for say reading from a DB, or a flat file, some vendor. If they output a standardized data structure, then the rest of your workflow would not be coupled to the implementation, but the common structure which Hamilton forces you to define. That way you can be pretty surgical with changes and understanding impacts.

Regarding assessing impacts, Hamilton provides the ability to "visualize" and query for lineage as defined by your Hamilton functions. We think that with Hamilton we can make the "hey what does this impact?" question really easy to answer, so that when you do need to make changes you'll have more confidence in doing them.

> managing the data over the lifetime of a model’s use,

Hamilton isn't opinionated about where data is stored. But given that if you define the flow of computation with Hamilton and use a version control system like git to version it, then all you need to then additionally track is what configuration your Hamilton code was run with, and associate those two with the produced materialized data/artifact (i.e. git SHA + config + materialized artifact), you have a good base with which to answer and ask queries of what data was used when and where. Rather than bringing in 3rd party systems to help here, we think there's a lot you can leverage with Hamilton to help here.

For example, we have users looking at Hamilton to help answering governance concerns with models produced.

> and serving adjacent needs (e.g. post-deployment analytics).

If it's offline, then you can model and run that with Hamilton. The idea is to help provide integrations with whatever MLOps system here to make it easy to swap out.

For online, e.g. a web-service, you could model the dataflow with Hamilton, and the build your own custom "compilation" to take Hamilton and project it onto a topology. During the projection, you could insert whatever monitoring concerns you'd want. So just to say, this part isn't straightforward right now, but there is a path to addressing it.


Thanks for your question! Good points -- those are not easy problems! I'd be really curious about what the community thinks, but here are my opinions:

While Hamilton/DAGWorks is mainly for expressing the pipeline/abstracting away the infrastructure, DAGWorks can help make the model lifecycle easy as well:

- Hamilton pipelines can be run anywhere, including in an online setting

- Breaking into functions can make it modular/easy to annotate and gather post-hoc analysis

- Hamilton pipelines specify the data movement in code -- providing a source of truth

That said, I think the ecosystem for doing this is much cleaner/easier to manage than it used to be -- the MLOps stack is far more sophisticated. Scalable/reliable compute (spark, Modin), easy storage (snowflake, new feature store technology), and more model experiment-as-a-service type systems (mlflow, model-db, etc...) have made these less of a difficult problem than it was in the past. As these permeate the industry and we develop more standards, I think that it pushes the problem up a level -- rather than figuring out exactly how to solve these, the difficult part is looping together a bunch of systems that all do it fairly well but (a) require significant expertise to manage and (b) often result in code that's super coupled to the systems themselves (making it hard to test). DAGWorks wants to decouple these from the pipeline code, enabling you to choose which systems you delegate to and not have to worry about it.

Furthermore, we think that smaller pipelines are actually super underserved in ML/data science community -- E.G. pipelines that don't have a lot of the "moving data around" problems but can be run on a single machine. I've seen these suffer from getting too complex/being difficult to manage, and we think Hamilton can solve this out of the box.

Thoughts?


Congrats for the launch Stefan and Elijah! :)

Like Stefan mention in the OP, Hamilton works well with tools like Metaflow which can help with many other concerns you mentioned. How you define your data transformations for ML is an open question that Hamilton addresses neatly.

See here for an example of Metaflow+Hamilton in action: https://outerbounds.com/blog/developing-scalable-feature-eng...


Thanks! Yeah -- love the example we built with the metaflow team. I even think Savin spoke right before me at PyData NYC in the same room!


> With Hamilton we use a new paradigm in Python (well not quite “new” as pytest fixtures use this approach) for defining model pipelines. Users write declarative functions instead of writing procedural code. For example, rather than writing the following pandas code

> These functions then define a "dataflow" or a directed acyclic graph (DAG), i.e. we can create a “graph” with nodes: col_a, col_b, col_c, and col_d, and connect them with edges to know the order in which to call the functions to compute any result.

This 'new paradigm' already exists in Polars. Within the scope of a local machine, you can write declarative expressions which can then be used pretty much anywhere for querying instead of the usual arrays and series (arguments to filter/apply/groupby/agg/select etc), allowing it to build an execution graph for each query, optimise it and parallelise it, and try to only run through the data once if possible without cloning. Eg the example above can be written simply as

    col_c = (pl.col('a') + pl.col('b')).alias('c')

It is obviously restricted to what is supported in polars, but a surprising amount of the typical data munging can be done with incredible efficiency, both cpu and ram wise.


Yeah! So we actually have an integration with polars. See https://github.com/DAGWorks-Inc/hamilton/blob/5c8e564d19ff23....

To be clear, the specific paradigm we're referring to is this way of writing transforms as functions where the parameter name is the upstream dependency -- not the notion of delayed execution.

I think there are two different concepts here though:

1. How the transforms are executed

2. How the transforms are organized

Hamilton cares about (2) and delegates to Polars/pandas for (1). The problem we're trying to solve is the code getting messy and transforms being poorly documented/hard to own -- Hamilton isn't going to solve the problem of optimizing compute as tooling like polars, pandas, and pyspark can handle that quite well.


Yep, we'd love more feedback on how to make the declarative syntax with Polars more natural with Hamilton so you can get the benefits of unit testing, documentation, visualization, swapping out implementations easily, etc.


Not to take away from anything you’ve done here, you guys have put a lot of great effort into this, but this paradigm is not “new”. It’s a common modeling paradigm in banks and hedge funds at least. I’ve built/worked on frameworks based on exactly this concept at 2 previous firms. Here’s some open source examples of the same concept: pyungo, fn_graph, Loman


Oh this is great! I knew about fn_graph (claimed "new" as we created this before fn_graph, but OS'd afterwards) -- I think we're talking with the author soon. But the others are awesome. I used to work at a hedge fund and I think this way of thinking came pretty naturally to me...


Yep, I heard that when we open sourced Hamilton initially -- it was right around the time there was a "Bank Python" post floating around too. When I chatted with Travis O. at about the same time, he pointed that fact out, but he said something like: "oh cool, you can do column level *and* row level computation. Nice." So I interpreted that some places don't have the flexibility that Hamilton has?

Otherwise yeah, those other libraries use the same concept, but it's interesting to see the very different UXs with them.


As far as I know, Polars inherited this idea from (Py)Spark, where it was intended more or less as a port of SQL. And it's not so different from how ORMs usually look and feel.

I think this design is a local maximum for languages that don't have first-class symbols and/or macros like R and Julia. I like to see convergence in this space.

It's also interesting because this style of API is portable more or less unchanged to just about any programming language, from C# to Idris.


> It's also interesting because this style of API is portable more or less unchanged to just about any programming language, from C# to Idris.

Yep I think a declarative syntax is quite portable and can be reimplemented easily in other languages.

On the portable note, where portable we mean swapping dataframe implementations, it's even conceivable to write "agnostic" logic with Hamilton and then at runtime inject the right "objects" that then do the right thing at runtime. E.g. the following is polars specific:

  col_c = (pl.col('a') + pl.col('b')).alias('c')

I think with Hamilton you could be more agnostic and enable it to run on both Pandas and Polars -- with TYPE here a placeholder to indicate something more generic...

  def col_c(a: TYPE, b: TYPE) -> TYPE:
      return a + b

So at runtime you'd instantiate in your Driver some directive to say whether you're operating on pandas or with polars (or at least that's what I imagine in my head) and the framework would take care of the rest...


genuine question - do Polars and Duckdb overlap in the problem space ?


I'll let someone with more polars & duckdb experience to weigh in.

But in short yes. Especially if you take the perspective they're both trying to help you do operations over tabular data, where the result is also something tabular.

Duckdb is "A Modern Modular and Extensible Database System" (https://www.semanticscholar.org/paper/DuckDB-A-Modern-Modula...). So it has a bit more to it than polars, as it has a lot of extensibility, for example, you can give it a pandas dataframe and it'll operate over it, and in some cases, faster than pandas itself.

But otherwise at a high-level, yes you could probably replace one for the other in most instances, but not true for everything.


Congrats on the launch, guys! Hamilton was the first MLOps library that really seemed to fit the challenges we face, because it offered a more granular way to structure our code. Really excited to see what other tools are on the way.


Thank you! Recognize your username -- two of our newer decorators are your design/suggestion :)

https://hamilton.readthedocs.io/en/latest/reference/api-refe...


yo congrats again on the launch! Anders dbt Labs here with a "tough" question for you. Apologies for 1) my response being half-baked,and 2) if i haven't done my homework about Hamilton's features.

coincidentally, my PR to the dbt viewpoint was closed by the docs team as "closed, won't do" [1]

I really like the convention of data plane (where you describe how the data should be transformed) and the control plane (i.e. the configuration of the DAG, do this before this). In this paradigm, I believe that the control plane should be as simple as possible, and even perhaps limited in what can be done with the goal of pushing the user to take data transformation as tantamount. Maybe this is why I fell in love with dbt in the first place is because it does exactly this.

"spicy" take: allowing users to write imperative code (e.g. using loops) that dynamically generates DAGs are never a good idea. I say this as someone who personally used to pester framework PMs for this exact feature before. While things like task groups (formerly subDAGs) [2] appear initially to be right answer, I always ended up regretting them. They're a scheduling/orchestration solution to a data transformation problem

Can y'all speak to how Hamilton views the data and control plane, and how it's design philosophy encourages users to use the right tool for the job?

p.s. thanks for humoring my pedantry and merging this! [3]

[1]: https://github.com/dbt-labs/docs.getdbt.com/pull/2390 [2]: http://apache-airflow-docs.s3-website.eu-central-1.amazonaws... [3]: https://github.com/DAGWorks-Inc/hamilton/pull/105


> "spicy" take: allowing users to write imperative code (e.g. using loops) that dynamically generates DAGs are never a good idea.

Said with all love: you're definitely wrong and it's not hard to demonstrate.

  1) some upstream service exports data as a series of chunked files each day, you can't know in advance how many
  2) processing a single file pegs a CPU core for ~an hour
  3) you want a task per chunk so you can process them in parallel
  4) ergo: you NEED dynamic task generation
This is a thing that Airflow famously couldn't do until 2.3 or 2.4. Luigi could always do it. Prefect, Nextflow, and Martian(?) can do it. DBT can't do it, though the concept doesn't exactly apply (the DBT version is more like "I want to generate models at compile time" which is much more debatably sane... I have this problem currently and our solution sucks, but mostly I wish I didn't have the problem!)

I see where you're coming from, I think, but my experience has been the opposite. In any major project, I absolutely need my DAGs to have a dynamic shape. Forcing them to have a static shape just pushes the required dynamism outside the tool, likely into some nasty code-generation step.


Interesting... So I'm not sure I agree. While Hamilton does support this kind of thing (and we're likely going to build out more support), the assumption above is that the unit of work at the task-level is natural to map to individual items in your example.

In my experience, you want to decouple the task-level orchestration mechanism from the nature of the data itself. A queuing system with multiple consuming threads/processes or a distributed system like spark is kind of meant for this type of task, and can do it more efficiently. So, why rely on the orchestrator to handle it when you potentially have more sophisticated tooling at hand?

To make it more concrete, say your upstream files change to have 10x smaller chunks, and 10x more files -- does the same orchestration system make sense? Are you going to start polluting the set of tasks?

If you did want to rely on the orchestrator for parallelism, an alternative strategy could be to chunk and assign the files to each task -- E.G. 3 files per task, and use a stable assignment method to round-robin them (basically the same as the queuing system). Might not get everything done at quite the parallel pace, but the DAG would be stable.

Your case is particularly tricky as they take so long to process, but it seems to me that this might be better suited for a listening/streaming tool to react when new files are added, then they can add the semi-processed data to a single location, and your daily orchestration task could read from that and set a high-watermark (basically a streaming architecture).

Anyway, not 100% sure how I feel about this but wanted to present a counter-point...


Oh and one thing I forgot to address above:

> In my experience, you want to decouple the task-level orchestration mechanism from the nature of the data itself.

No, that's a fantasy. The data is EVERYTHING. There is no abstract, platonic solution that works equally well with 100 rows or 100 billion rows.

One of the unique things about Data Engineering that sets it apart from most other specialties is that we deal in bulk. The code we write takes significant wall-clock time to execute. Performance ALWAYS matters. Never forget that!


> There is no abstract, platonic solution that works equally well with 100 rows or 100 billion rows.

Very true! One reason why navigating the options available is tough, since it's pretty dependent on your problem space.


In short, no, no, and no.

If my DAG tool says "I can't handle this, use Spark" I'm throwing that tool away. This is not an exotic Big Data problem, it is extremely pedestrian and common. It's just not reasonable to punt and say "go adopt some 10-million-line behemoth" as a solution. Spark is cool and all but if you're not using it, adopting it is a massive step. The simple answer to "why rely on the orchestrator to handle it" is that that is literally what the orchestrator is for.

> To make it more concrete, say your upstream files change to have 10x smaller chunks, and 10x more files -- does the same orchestration system make sense? Are you going to start polluting the set of tasks?

This should be handled very smoothly and transparently, and Luigi for example succeeds in doing so. If I tell my framework "generate a task for each input chunk, and run with 4 threads" then it just doesn't matter how many tasks get generated. I routinely run Luigi programs with thousands of tasks generated in this manner. What's funny about this is that your hypothesized scenario actually happened: the system we exported this data from decided one day to radically decrease the chunk size / increase the chunk count, without telling us. I didn't have to change my code at all! In fact I didn't even notice for a like a month.

Mapping N chunks onto M tasks breaks catastrophically when you start thinking about resumption. Let's say I have a simple DAG: [invoke export] -> { task per chunk } -> [finalize]. As long as each chunk task has a deterministic name/identity, completion state can be tracked for it. Thus if the enclosing job terminates unexpectedly, we can easily resume from where we left off simply by starting the job again. If there's no correspondence between inputs, tasks, and outputs, you can't do it, not without pushing state tracking into user code, which defeats the purpose of the framework.

Reasonable people can disagree here. Again, Airflow chose not to support this for a long time. But it's not a coincidence that I decided in 2016 that Airflow was pretty useless to me. Dynamic DAG shape was and is a bare minimum requirement for me, and is one of the first things I test with new tools.

Even if you don't think this is a great use case, or your sweet spot, it's important to realize that it is very, very common, and when your users hit it they will look up how to deal with it. When the answer is "we can't help you here" that's extremely disappointing, and that will motivate users to leave.

Maybe the best overall advice I can give you is "spend some time using Luigi" and understanding its model and what it enables. It is absolutely not a perfect tool (in particular, depending on Tasks rather than Targets is probably a design error), but it is the only one I've truly been successful with. Looked at a certain way, it is radically more powerful than most of its competitors. It's sad that it gets overlooked just because it doesn't have a built-in cron or a fancy web UI.


> Even if you don't think this is a great use case, or your sweet spot, it's important to realize that it is very, very common, and when your users hit it they will look up how to deal with it. When the answer is "we can't help you here" that's extremely disappointing, and that will motivate users to leave.

Hamilton doesn't support that, but it hasn't made a decision decision that would prevent that :) Yeah I'm definitely going to dive deeper into Luigi. Thanks for the rec. Especially wondering now whether it abstracts things as data types, or some other way...

As an aside, you might like checking out redun, the influences section is particularly enlightening as to what other systems there are - https://insitro.github.io/redun/design.html#influences


Great questions!

> "spicy" take: allowing users to write imperative code (e.g. using loops) that dynamically generates DAGs are never a good idea.

Can you give some examples of when it was a bad idea? Otherwise to clarify, with Hamilton, there is no dynamism at runtime. When the DAG is generated, it's generated and it's fixed. The operator we have for doing this `@parameterize` requires everything to be known at DAG construction time. It's really just short hand for manually writing out all the functions. So I don't think it's quite the same story - it is more a "power user feature" - but when used, it makes code DRY-er, at the cost of some code readability.

> While things like task groups (formerly subDAGs) [2] appear initially to be right answer, I always ended up regretting them. They're a scheduling/orchestration solution to a data transformation problem

Yep. Hamilton has a concept of `subdag` too. It's really short hand for "chaining" Hamilton drivers. We take the latter approach (I believe) since it is there to help you more easily reuse parts of your DAG with different parameterizations. Since Hamilton isn't concerned with materialization boundaries we don't have to make a decision here so how it impacts scheduling/orchestration can be punted to a later time :)

> Can y'all speak to how Hamilton views the data and control plane,

Hamilton is just a library. There is no DB that needs to be run to use Hamilton. The only state required is your code. So at the simplest micro-level, Hamilton sits within a task, e.g. creating features, and replaces the python script you'd run there. So at this level, I'd argue data plane vs control plane doesn't really apply, unless you view code as the control plane and where it runs as the data plane... At the macro-level, e.g. a model pipeline pulling data, transforming it, fitting a model, etc., you can logically describe the dataflow with Hamilton, without breaking it up into computational tasks a priori. I'd say Hamilton here tries to be agnostic and provide the hooks you need to help you coordinate your control plane and facilitate operation on your data plane. Note: this is where we see DAGWorks coming in and helping provide more functionality for. E.g. with Hamilton you don't need to decide whether everything runs in a single task say on airflow, or multiple. It's up to you to make that decision. The beauty of which, is that conceptually, changing what is in a task, is really just boilerplate given all the information you have already encoded into your Hamilton DAG.

> and how it's design philosophy encourages users to use the right tool for the job?

With Hamilton, we believe python UDFs are the ultimate user interface. By using Hamilton we force you to chunk logic, integrations, into functions. We also provide ways to "decorate" function logic which gives the ability to inject logic around the running of said functions. So we're really quite agnostic to the tool, but want to provide the hooks to be able to easily and cleanly add, adjust, remove them. For example, to switch between Ray and Dask, our philosophy is that ideally you can write code that is agnostic to knowing about the implementation. Then at runtime add those concerns in. As another example, the ability to switch/change say observability vendors, should not force a large refactor on your code base. We have an extensible `@check_output` decorator that should constrain how much you "leak" from the underlying tools. In short: (1) write functions that don't leak implementation details, they should just instead try to limit to just expressing logic; (2) the Hamilton framework should have the hooks required for you to plug in "tool" concerns. Does that make sense? Happy to elaborate more.


Hey Stefan and Elijah, I really like the approach you're taking, especially with Hamilton being the open core.

I've got recent experience with data eng / pipleine startups and wondering if you are hiring for your first engineers at this time.


Thank you! We're not hiring just yet, but feel free to log onto the Hamilton OS slack and send us your resume -- happy to chat about future possibilities or connect you with people that are hiring in a similar space!


Any thoughts on how DAGWorks compared to something like Domino Datalab[1]?

1: https://docs.dominodatalab.com/en/latest/user_guide/bc1c6d/s...


Yeah! So I'm not an expert with domino, but looking at it, it serves a slightly different purpose.

DAGWorks is an opinionated way to write code that allows you to abstract it away from the infrastructure, whereas Domino is more about making it easy to deal with infrastructure, manage datasets, etc... Also has a large notebook/development focus. Nothing in is built to make it natural for the code to live for a while/keep it well-maintained, which is the problem DAGWorks is trying to solve.

The idea is that we can allow you to plug into whatever infrastructure you want and not have to think about it too much when you want to switch, although we're still building pieces of that out.


Congrats Stefan, from someone working at a competitor - always good to see more tools for production ML.


Thanks Josh! I actually think we're pretty complementary, less competitive. For example, it'd be very conceivable for users to use Hamilton, DAGWorks, and Databricks together!


Thanks! Appreciate it. I'm finding this space is massive and there's still more problems managing ML code than there are good solutions for it :) So lots of room for everyone.


As a long-time fan of DAG-oriented tools, congrats on the launch. Maybe you can get added here https://github.com/pditommaso/awesome-pipeline now or in the future...

This is a problem space I've worked in and been thinking about for a very, very long time. I've extensively used Airflow (bad), DBT (good-ish), Luigi (good), drake (abandoned), tested many more, and written two of my own.

It's important to remember that DAG tools exist to solve two primary problems, that arise from one underlying cause. Those problems are 1) getting parallelism and execution ordering automatically (i.e. declaratively) based on the structure of dependencies, and 2) being able to resume a partially-failed run. The underlying cause is: data processing jobs take significant wall-clock time (minutes, hours, even days), so we want to use resources efficiently, and avoid re-computing things.

Any DAG tool that doesn't solve these problems is unlikely to be useful. From your docs, I don't see anything on either of those topics, so not off to a strong start. Perhaps you have that functionality but haven't documented it yet? I can imagine the parallelism piece being there but just not stated, but the "resumption from partial failure" piece needs to be spelled out. Anyway, something to consider.

A couple more things...

It looks like you've gone the route of expressing dependencies only "locally". That is, when I define a computation, I indicate what it depends on there, right next to the definition. DBT and Luigi work this way also. Airflow, by contrast, defines dependencies centrally, as you add task instances to a DAG object. There is no right answer here, only tradeoffs. One thing to be aware of is that when using the "local" style, as a project grows big (glances at 380-model DBT project...), understanding its execution flow at a high level becomes a struggle, and is often only solvable through visualization tools. I see you have Graphviz output which is great. I recommend investing heavily in visualization tooling (DBT's graph browser, for example).

I don't see any mention of development workflow. As a few examples, DBT has rich model selection features that let you run one model, all its ancestors, all its descendants, all models with a tag, etc etc. Luigi lets you invoke any task as a terminal task, using a handy auto-generated CLI. Airflow lets you... run a single task, and that's it. This makes a BIG DIFFERENCE. Developers -- be they scientists or engineers -- will need to run arbitrary subgraphs while they fiddle with stuff, and the easier you make that, the more they will love your tool.

Another thing I notice is that it seems like your model is oriented around flowing data through the program, as arguments / return values (similar to Prefect, and of course Spark). This is fine as far as it goes, but consider that much of what we deal with in data is 1) far too big for this to work and/or 2) processed elsewhere e.g. a SQL query. You should think about, and document, how you handle dependencies that exist in the World State rather than in memory. This intersects with how you model and keep track of task state. Airflow keeps task state in a database. DBT keeps task state in memory. Luigi track task state through Targets which typically live in the World State. Again there's no right or wrong here only tradeoffs, but leaning on durable records of task state directly facilitates "resumption from partial failure" as mentioned above.

Best of luck.


Thank you! This is awesome to hear from someone so knowledgable on the space. Really great feedback :)

> It's important to remember that DAG tools exist to solve two primary problems, that arise from one underlying cause. Those problems are 1) getting parallelism and execution ordering automatically (i.e. declaratively) based on the structure of dependencies, and 2) being able to resume a partially-failed run. The underlying cause is: data processing jobs take significant wall-clock time (minutes, hours, even days), so we want to use resources efficiently, and avoid re-computing things.

> Any DAG tool that doesn't solve these problems is unlikely to be useful. From your docs, I don't see anything on either of those topics, so not off to a strong start. Perhaps you have that functionality but haven't documented it yet? I can imagine the parallelism piece being there but just not stated, but the "resumption from partial failure" piece needs to be spelled out. Anyway, something to consider.

Agreed that these are two of the main purposes, but I think that we've found "organizing code" to be up there with it. Managing code, transformations, and linking it to data can be quite a challenge -- this is the main focus of Hamilton. Hamilton has (1) to some extent -- we have ray/dask integrations to help with parallelism, and have the capability to extend it significantly (although we haven't found the right use-case to dig in just yet). Re (2) its something we've been prototyping and asked for. Agreed that it is of high value.

> It looks like you've gone the route of expressing dependencies only "locally". That is, when I define a computation, I indicate what it depends on there, right next to the definition. DBT and Luigi work this way also. Airflow, by contrast, defines dependencies centrally, as you add task instances to a DAG object. There is no right answer here, only tradeoffs. One thing to be aware of is that when using the "local" style, as a project grows big (glances at 380-model DBT project...), understanding its execution flow at a high level becomes a struggle, and is often only solvable through visualization tools. I see you have Graphviz output which is great. I recommend investing heavily in visualization tooling (DBT's graph browser, for example).

First, I really like the description of "local" style -- I'd been struggling with how to represent it. But yes, the bigger things get, the uglier visualizing the entire flow is. but the easier it gets to figure out how an individual piece works. Our thought is that, as things get uglier, teams are going to want to be able to dig into just their specific focus (module, function, etc...). Re:graphviz, we have some flexible visualization on the OS side but some pretty powerful stuff on the closed-source side that we're playing around with too. Will look at DBT's graph browser!

> I don't see any mention of development workflow. As a few examples, DBT has rich model selection features that let you run one model, all its ancestors, all its descendants, all models with a tag, etc etc. Luigi lets you invoke any task as a terminal task, using a handy auto-generated CLI. Airflow lets you... run a single task, and that's it. This makes a BIG DIFFERENCE. Developers -- be they scientists or engineers -- will need to run arbitrary subgraphs while they fiddle with stuff, and the easier you make that, the more they will love your tool.

Yeah, definitely worth bringing out -- its a great point! Right now the interaction is through python code and the driver, but I think we can highlight how to run things in the documentation as well + add some more complex features. We have a notion of "inputs" and "overrides", allowing you to do basically all of that, but its definitely worth exposing to the user in a friendly way.

> Another thing I notice is that it seems like your model is oriented around flowing data through the program, as arguments / return values (similar to Prefect, and of course Spark). This is fine as far as it goes, but consider that much of what we deal with in data is 1) far too big for this to work and/or 2) processed elsewhere e.g. a SQL query. You should think about, and document, how you handle dependencies that exist in the World State rather than in memory. This intersects with how you model and keep track of task state. Airflow keeps task state in a database. DBT keeps task state in memory. Luigi track task state through Targets which typically live in the World State. Again there's no right or wrong here only tradeoffs, but leaning on durable records of task state directly facilitates "resumption from partial failure" as mentioned above.

Yes, that's a great point. Hamilton is a lightweight library, so there's only so much it should do around holding state. E.G. it won't be the piece orchestrating itself/triggering itself, that'll be external to it. Hamilton often starts when the data is in some manipulatable form (pandas, polars, or pyspark if the data is too big), and the user wants to run a bunch of transformations on it. That said, a common pattern is something like this:

  def data_stored_in_some_db(con: Connection) -> pd.DataFrame:
     return pd.read_sql("SELECT * FROM MY_TABLE WHERE ...", con=con)
which brings data from the outside world into the Hamilton world -- bridging the gap. Our plan is that we can use Hamilton as the language, allowing the user to choose between different orchestration environments -- this is where DAGWorks comes in. Think kind of a terraform for specifying pipelines/workflows. The idea is that someone would write Hamilton code, then run locally with small data to test, compile to airflow/luigi/metaflow for production, etc... I think that this is particularly powerful as the concerns of the data scientist writing the pipeline are going to be different from that of the platform team -- which should be able to plug in their desired orchestration framework behind the scenes.

> Best of luck.

Thank you! Really appreciate your feedback and thoughts -- this is super valuable. Feel free to reach out on the Hamilton slack or book time with us (on the website) if you want to nerd out about data pipelines more :)


Ah, it seems like you're imagining Hamilton to be used to structure sort of "small" pieces, which might then be orchestrated into a Big Picture thing by another tool, something in the Airflow/Dagster/Argo/Flyte class. Or perhaps a paid service offered by DAGWorks in the future...

One, that's reasonable, and as you say there are code organization and testing benefits. I would emphasize that that's the recommended pattern. I would also work to establish, and document, the details of how folks should go about doing that, and provide solid examples. (BTW your "air quality analysis" example is quite good, being far from trivial yet still example-sized in complexity.)

Two, ehhhhhhh I'm a little skeptical of most teams' ability to factor their projects that well. Folks will want to re-use outputs that are seen as useful, especially if they are expensive to compute. This causes DAG scope to grow and grow and grow. DBT in particular is vulnerable to this, and I have been told of 1000-model projects, which is just yuck. This isn't a problem you have to solve right now, but it's worth thinking about. As a motivating example, what if someone wanted to take the p-value output by the air quality example, and use that as an input into [some other thing]? What would be the "right" way to express that?


> Ah, it seems like you're imagining Hamilton to be used to structure sort of "small" pieces, which might then be orchestrated into a Big Picture thing by another tool, something in the Airflow/Dagster/Argo/Flyte class. Or perhaps a paid service offered by DAGWorks in the future...

Yep. Hamilton is good at modeling the "micro". You can also express the "macro" via Hamilton, and then later determine how to "cut" it up for execution on airflow/dagster/etc.

> One, that's reasonable, and as you say there are code organization and testing benefits. I would emphasize that that's the recommended pattern. I would also work to establish, and document, the details of how folks should go about doing that, and provide solid examples. (BTW your "air quality analysis" example is quite good, being far from trivial yet still example-sized in complexity.)

Yep thanks for the feedback. Documentation is something we're slowly chipping away at; that's good feedback regarding the example. I think I'll take that phrasing "far from trivial yet still example-sized in complexity" as a goal for some examples.

> Two, ehhhhhhh I'm a little skeptical of most teams' ability to factor their projects that well.

Agreed. Though we hope the focus on "naming" and forced "python module curation" help nudge people into better patterns than just appending to that SQL/Pandas script :)

> Folks will want to re-use outputs that are seen as useful, especially if they are expensive to compute. This causes DAG scope to grow and grow and grow. DBT in particular is vulnerable to this, and I have been told of 1000-model projects, which is just yuck. This isn't a problem you have to solve right now, but it's worth thinking about.

Yep. Agreed. I think Hamilton's model scales a bit better than DBTs - one team at Stitch Fix manages over 4000 feature transforms in a single code base. Some of that I think comes from the fact that you can think in columns, tables, or arbitrary objects with Hamilton, and you have some extra flexibility with materialization (e.g. don't need that column, don't compute it). But as you point out, for expensively computed things, you likely don't want to re-materialize them. To that end, right now, you can get at this manually. E.g. ask Hamilton what's required to compute a result, and if you have it cached/stored, retrieve and pass in as an override. We could also do more framework-y things and do more global caching/connecting with data stores to prevent unneeded re-computation...

> As a motivating example, what if someone wanted to take the p-value output by the air quality example, and use that as an input into [some other thing]? What would be the "right" way to express that?

The Hamilton way would be to express that dependency as a function in all cases. But, yes, do you recompute, or do you share the result (assuming I understood your point here)? Good question, and it's something we've been thinking about, and would love more design partnership on ;) -- since I think the answer changes a lot depending on the size of the company, and the size of the data. There are nice things about not having to share intermediate data, but then there are not. I'm bullish though, that with Hamilton we have the choice to go either way. The Hamilton DAG logically doesn't change, it's really how computation/dependencies are satisfied.

@Elijah anything I missed?


Think you got it! Re factoring projects well, its interesting, but I think that there's some good strategies here. What's worked for us is working backwards -- starting with the artifact you want and progressively defining how you get there until you get the data you need to load.

Thanks btw for all the feedback! This is great.


would love to collaborate on an integration with pyquokka (https://github.com/marsupialtail/quokka) once I put out a stable release end of this month :-)


Would love! Yeah I think how we did the PySpark Map UDF support should enable us to do something similar with Quokka.


Can this be set up to yield data from individual functions instead of simply returning it?


Thanks for the question! Clarification -- by "yield" do you mean "return a single function's result" or "use a function that's a generator"?

In the former case, yes, it's pretty easy:

  driver.execute(final_vars=["my_func_1", "my_func_2"]) 
will just give out the specific ones you want. In the latter case, it's a little trickier but doable -- we were just going over this with a user recently actually! https://github.com/DAGWorks-Inc/hamilton/issues/90


How can I convince someone to try this, if they are comparing this solution with dbt? Can only pick one :)


First, worth jumping on tryhamilton.dev -- we made it to make it easy to get started.

And, as always, it depends on your use case! If it’s all analytics and DWH manipulations — that’s not Hamilton’s expertise. If you want to do any DS work (model training, feature engineering, fine-grained lineage, etc…) and need to work in python, Hamilton is a great choice!

Otherwise hamilton is a library - we do have an example of integrating with dbt

https://towardsdatascience.com/hamilton-dbt-in-5-minutes-62e...


I honestly prefer the first approach in the example given.


love the transparency this brings. Any 3rd party tools with plans to integrate with it? (e.g. analytics layer companies?)


Thanks! Lots of plans in the works. Two directions we're moving:

1. Integrate with orchestration systems (E.G. run Hamilton pipelines on different orchestration platforms). You could imagine compiling to airflow pipelines, running hamilton on metaflow, compiling to a vertex pipeline etc...

2. Adapters to load data from/save data to external providers. E.G. logging data quality to whylogs, loading data up from snowflake, saving a model to mlflow...

We're actively working on building this out though -- have some partners who hare helping build out use-cases!


Ola, I'm the founder of windmill [1] which is an OSS orchestration platform for scripts, including python, and I'm pretty sure one could compile an hamilton dag to a windmill openflow very simply (we use a mounted folder in `./shared` to share heavy data across steps). Right now, one can do ETL using polars on windmill but i'd love to have an even more structure code framework to recommend.

We should chat!

[1]: https://github.com/windmill-labs/windmill


Nice! Yep that's part of our vision to "compile onto", i.e. generate code for, frameworks such as yours.

Happy to sketch something out - want to join our slack to chat https://join.slack.com/t/hamilton-opensource/shared_invite/z...?


> Most companies running ML in production need a ratio of 1:1 or 1:2 data scientists to engineers. At bigger companies like Stitch Fix, the ratio is more like 1:10 — way more efficient

Did you write these wrong way round, maybe? Or are you saying a ratio of 1 data scientist to 10 engineers is efficient?


Good catch—thanks! Fixed above.


Amazing.


ah yep -- good catch -- d'oh -- and it's too late to update.

But yes, the data scientists vs engineers should be flipped.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: