Hacker News new | past | comments | ask | show | jobs | submit login
Snakemake – A framework for reproducible data analysis (snakemake.github.io)
174 points by gjvc 10 months ago | hide | past | favorite | 48 comments



In the (unusually long) introduction of our paper on SciPipe, we did a pretty thorough overview and contrasting of the pros and cons of the top similar tools, including Snakemake, as we basically tried most of them out before realizing they didn't at the time solve our problems:

https://academic.oup.com/gigascience/article/8/5/giz044/5480...

(A little background, FWIW: We had extra tough requirements as we were implementing the whole machine learning progress in the pipeline, so needed to combine parameter sweeps for hyperparameter optimization with cross-validation, and none of the tools, at least at the time did meet the needs for dynamic scheduling together with fully modular components.

Snakemake and similar tools in our experience are great for situations where you have a defined set of outputs you want to be able to easily reproduce (think, figures for an analysis), but can become harder to reason about when the workflow is highly dynamic and it might be hard to express in terms of file name patterns the desired output files.

Nextflow subsequently has implemented the modular components part which we missed (and implemented in SciPipe), but we are still happy with SciPipe as it provides things like an audit log per output file, workflows copileable to binaries and great debugging (because of plain Go), all with a very small codebase without external Go dependencies.)


You might like mandala (https://github.com/amakelov/mandala) - it is not a build recipe tool, rather it is a tool that tracks the history of how your builds / computational graph has changed, and ties it to how the data looked like at each such step.


That's a cool approach, thanks for sharing!


Shameless plug for a project I'm somewhat involved in: Hydra-genetics provides a growing set of well structured snakemake modules for bioinformatics (NGS) workflows.

https://github.com/hydra-genetics/


FYI the link to the "installation page" in scipipe's repo's README is 404.


Ouch, thanks, fixed!


Snakemake is a beautiful project and evolves and improves so fast. Years ago I realized I needed to up my game from the usual bash based NGS data processing pipelines I was writing. Based on several recommendation I choose Snakemake. I have never regretted it, It worked perfectly on our PBS cluster then on our Slurm cluster. I made some steps to make it run on K8s, which it supports, and most recently, I'm still/again happy with my choice for Snakemake because it (together with Nextflow) seems to be the chosen framework for GA4GH's cloud work stream's "products" like WES and TES [0]. This seems to be the tech stack where Amazon Omics and Microsoft Genomics focus on [1]. It enables many cool things, like "data visiting": Just submit your Snakefile (the definition of a workflow, a DAG basically) to a WES API in the same data center where your data lives, and data analysis starts, near the data. Brilliant.

I owe a lot to Snakemake and Johannes Köster, I hope some day I can repay him and his project.

[0] https://www.ga4gh.org/work_stream/cloud/

[1] https://github.com/Microsoft/tes-azure


I too owe a lot of my PhD and postdoc productivity to Snakemake. It's my bioinformatics super-power, allowing me to run a complex analysis, including downloading containers (Singularity/Apptainer) and other dependencies (conda), with one command.

Great for reproducibility. Great for development. Great for scaling analyses.

Snakemake is vital infrastructure for my work.


Its fantastic but it doesn't scale laterelly particularly well, compared to just Make.


What dimension are you referring to?


Large scale reproducibility was a problem a few years back for one. Conda and containers were a constant problem for us back then, especially if you had multiple NGS tools running in different environments. This has probably been solved by now, but we went with another workflow system


Agreed, Conda has always been a nightmare to maintain and redeploy, whatever you put it in.


Nextflow seems to scale very well


We went with Nextflow and Galaxy


You can't go wrong with nextflow. I heard a lot of scientists complaining that it's too hard to understand, but honestly the DSL and the scheduling model (flow based) is just great


100% agree, and it's wonderful to see Snakemake on the top of HN.

Snakemake is an invaluable tool in bioinformatics analysis. It's a testament to Johannes' talent and dedication that, even with the relatively limited resources of an academic developer, Snakemake has remained broadly useful and popular.

Super nice guy too, he's always been remarkably responsive and helpful. I saw him present on Snakemake back when he was a postdoc, and it really changed my approach to pipeline development.


Snakemake is great, but it does feel like just a slightly more modern Make.

I am pretty excited about research projects that tie the recipe and the computation closer together so that you do not preserve just the last recipe, but the whole history of exploratory computation and analysis.

E.g. mandala (https://github.com/amakelov/mandala), a project of a colleague of mine which is basically semantic git for your computational graph and data at the same time.


I work with Snakemake for computational biology. I see a lot of confusion as to why Snakemake exists when workflow management tools like Airflow exist, which mirrors my sentiment when moving from normal software to bio software.

Snakemake is used mostly by researchers who write code, not software engineers. Their alternative is writing scrips in bash, Python, or R; Snakemake is an easy-to-learn way to convert their scripts into a reproducible pipeline that others can use. It's popular in bioinformatics.

Snakemake also can execute remotely on a shared cluster or cloud computing. It has built-in support for common executors like SLURM, AWS, and TES[1].

Snakemake isn't perfect, but it helps researchers jump from "scripts that only work on their laptop" to "reproducible pipelines using containers" that easily run on clusters and cloud computing. Running these pipelines is still pretty quirky[2], but is better than the alternative of unmaintained and untested scripts.

There are other workflow managers further down the path of a domain-specific language, like Nextflow, WDL, or CWL. Nextflow is a dialect of Java/Groovy that is notoriously difficult to learn for researchers. Snakemake, in comparison, is built on Python and has a less steep learning curve and fewer quirks.

There are other Python based workflow managers like Prefect, Metaflow, Dagster, and Redun. They're great for software engineers, but don't bridge the gap as well with researchers-who-write-code.

[1] TES is an open standard for workflow task execution that's usable with most bioinformatics workflow managers, like HTML for browsers.

[2] I'm trying to fix this (flowdeploy.com), as are others (e.g. nf-tower). I think the quirkiness will fade over time as tooling gets better.


I don't get why you claim something like airflow doesn't bridge the gap well with resear hers who write code. I've worked with wdl extensively, and I still think that airflow is a superior tool. The second I need any sort of branching logic in my pipeline, the ways of solving this feel like you are working against the tool, not with it.


The bioinformatics workflow managers are designed around the quirkiness of bioinformatics, and they remove a lot of boilerplate. That makes them easier to grok for someone who doesn't have a strong programming background, at the cost of some flexibility.

Some features that bridge the gap:

1. Command-line tools are often used in steps of a bioinformatics pipeline. The workflow managers expect this and make them easier to use (e.g. https://github.com/snakemake/snakemake-wrappers).

2. Using file I/O to explicitly construct a DAG is built-in, which seems easier to understand for researchers than constructing DAGs from functions.

3. Built-in support for executing on a cluster through something like SLURM.

4. Running "hacky" shell or R scripts in steps of the pipeline is well-supported. As an aside, it's surprising how often a mis-implemented subprocess.run() or os.system() call causes issues.

5. There's a strong community building open-source bioinformatics pipelines for each workflow manager (e.g. nf-core, warp, snakemake workflows).

Airflow – and the other less domain-specific workflow managers – are arguably better for people who have a stronger software engineering basis. For someone who moved wet lab to dry lab and is learning to code on the side, I think the bioinformatics workflow managers lower the barrier to entry.


> are arguably better for people who have a stronger software engineering basis

As someone who is a software developer in the bioinformatics space (as opposed to the other way around) and have spent over 10 years deep in the weeds of both the bioinformatics workflow engines as well as more standard ones like Airflow - I still would reach for a bioinfx engine for that domain.

But - what I find most exciting is a newer class of workflow tools coming out that appear to bridge the gap, e.g. Dagster. From observation it seems like a case of parallel evolution coming out of the ML/etc world where the research side of the house has similar needs. But either way, I could see this space pulling eyeballs away from the traditional bioinformatics workflow world.


The problem with Airflow is that each step of the DAG for a bioinformatics workflow is generally going to be running a command line tool. And it'll expect files to have been staged in and living in the exact right spot. And it'll expect files to have been staged out from the exact right spot.

This can all be done with Airflow, but the bioinformatics workflow engines understand that this is a first class use case for these users, and make it simpler.


Awesome to see Snakemake discussed here and many thanks for the positive feedback below. Thanks to the awesome and very active community, Snakemake evolves very fast in the recent years, while being widely used (on average >10 new citations per week in 2022). The best overview about the main features and capabilities can be found in our rolling paper, which is updated from time to time when new important features become available. It also contains various design patterns that are helpful when designing more complex workflows with Snakemake. You can find the paper here: https://doi.org/10.12688/f1000research.29032.2

And here is a little spoiler: I am currently working together with Vanessa Sochat on implementing a comprehensive plugin system for Snakemake, which will be used for executor backends but soon also for Snakemake's remote filesystem, logging, and scripting language support (and maybe more). The current implementations of those things will be moved to plugins but still maintained by the core team. This approach will enable a further democratization of Snakemake's functionalities because people can easily add new plugins without the need to integrate them into the main codebase. The plugin APIs will be kept particularly stable, so that Snakemake updates will in most cases not require any changes in the plugin implementations.


I love snakemake. It almost saved my PhD, but then I found nextflow which suited my type of problems better. What's is slightly off-putting about snakemake is that the internal API isn't well documented, I wanted to contribute a new remote for SQL database and I had to figure out most of the method by comparing with other examples. Anyway my PR has been inactive since months, which surprised me since usually they tend to review and approve quickly


I see it can be used to define and run workflows. But for reproducibility on top of the execution of operations on data, I'm wondering if there's any way to version the underlying data?

Or would you typically use this -in addition- to a tool like dvc? I've used dvc a bit and while it's quite good for data versioning, I find the workflow aspect is clunky.


The way a workflow like Snakemake can help here is generally by letting the filenames pretty much describe how each particular output was created, meaning data outputs can act as immutable in a sense.

What I mean is that rather than create a new version of a file, if you run the same analysis with different sets of parameters, it should generate a new file with a different name rather than a new version of the old one. This also helps comparing differences between output from different parameters etc.

That said, there are workflow platforms which support data versioning, such as Pachyderm (https://pachyderm.com), but it is a bit more heavyweight as it runs on top of Kubernetes.


The reliance on filenames to define (parametric) dependencies was among the reasons I later adopted nextflow. The model fit the type of computation dependencies better for my case. In the mean time snakemake grew and many DAG that were hard to describe back then are now expressed directly with snakemake primitives


I am very fond of https://zenodo.org, especially for small datasets for scientific publications.


There is indeed an old issue asking whether they could return a provenance graph using the prov ontology in JSON format https://github.com/snakemake/snakemake/issues/2077 I think it's a good task to work on if you like to contribute


I use snakemake along with dvc. It detects changes in files, and reruns steps to produce downstream files as needed.


Pachyderm has data versioning / data version control built-in. I guess other tools do too.


This is a fantastic project. It's crucial to note that Snakemake is an extension of Python, meaning you can directly incorporate Python code into your makefiles.

In our team, we utilize Snakemake along with Singularity. The key operations, such as model training and inference that aren't straightforward shell commands, are compartmentalized using containers. Snakemake significantly simplifies the process of integrating these various modules.


When I was doing my bioinformatics masters I learned how to make all my pipelines in Snakemake, so I never learned how to set up pipelines by chaining bash scripts. I was confused when I encountered pipelines that weren’t written in snakemake or nextflow.


How does this differ from something like Airflow ?


One way to contrast these in broad terms is that tools like Airflow are generally much more focussed on continually running well defined production workflows, while tools like Snakemake are mostly focused on creating a reproducible pipeline for one particular analysis, or making it easy to perform exploratory analysis, exploring the effect of different input data and parameters.

One way this focus is reflected is how e.g. Snakemake are much more focused on naming files in a way that you can figure out what input data and parameters were used to produce them, making it easier to compare results from variations on the main analysis.


If you are interested in knowing more about the differences of a pull-style workflow engine like Snakemake which is geared towards Bioinfo problems vs a push-style workflow engine which is geared towards data engineering, you might find our write up helpful: https://insitro.github.io/redun/design.html#influences

There are other important dimensions on which workflow engines differ, such as reactivity, file staging, and dynamism.


There are a lot of differences.

By design, Airflow needs a centralized server/daemon, whereas snakemake is just a command line tool, like make/cmake. This would become an issue in the HPC environment.

In Airflow the workflow is assumed to be repetitively executed whereas in snakemake it is usually run once (like you will only compile the program when source files change).

Airflow has a queuing system, whereas Snakemake is to be used in conjuncture with other task management systems.

In Snakemake, shell script always feels like a second-order citizen, whereas with Snakemake, it has good support for shell scripting, which enables easy integration with tools made in other programming languages.


yaml


> With Snakemake, data analysis workflows are defined via an easy to read, adaptable, yet powerful specification language on top of Python. Each rule describes a step in an analysis defining how to obtain output files from input files. Dependencies between rules are determined automatically.

This appears to be a step backwards. Over a long career, I've seen the benefits of functional programming and dataflow languages for data processing. Sure, there is a time and place for intermediate files, but they are not a wise choice for a fundamental compositional principle.

Update: I can see why Snakemake is preferable over manual file processing pipelines. How can improvement be bad? Well, decisions are relative. Why choose X framework if there are better designs available? I think it is possible to solve more than one problem at once -- I don't see why the whole framework should be built around files as intermediate storage locations. Generally speaking, it is often possible to design a framework that solves incremental problems while also leapfrogging antiquated design choices.


It's really nice, but I think it would have been better if it did not mix YAML and Python source and rather provided an API for writing workflows completely in Python. All that you get from the YAML + Python mix is that tools that work with either YAML or Python does not work well.


For a very different approach, check out make-booster:

https://github.com/david-a-wheeler/make-booster

Make-booster provides utility routines intended to greatly simplify data processing (particularly a data pipeline) using GNU make. It includes some mechanisms specifically to help Python, as well as general-purpose mechanisms that can be useful in any system. In particular, it helps reliably reproduce results, and it automatically determines what needs to run and runs only that (producing a significant speedup in most cases). Released as open source software.


Snakemake seems really interesting (in a good way). I was going to say I wasn't aware of it but then in watching the Intro video I realized I had seen it before.

However, I find myself scratching my head about when I would use it, or what need it's fulfilling.

It seems like most of what it brings to the table should be directly inferrable from the analysis and presentation code and comments. If you're working with multiple systems, isn't this what scripting languages were originally for?

It seems like another layer of complexity on top of what's already there, which ... just adds complexity.

Or is the idea to make a data analysis-specific scripting DSL?


However, I find myself scratching my head about when I would use it, or what need it's fulfilling.

When you are in Next Generation Sequencing data analysis, it is basically THE tool. Snakemake comes with anything I ever needed, since it uses conda to install workflow dependencies, tools like BWA, fastqc, multiqc etc, are all present and fresh. I feel that Snakemake was made for NGS data analysis.

Many Snakemake users, like me, are biologists that know some Python and bash, I have never used Make for example. The people pointing out weaknesses here have complex, CS related use cases it seems, for bio-informaticians in the NGS field it's a godsend.


Yes, it's a DSL.

Here's a simple scatter-gather example. Let's say you want to count the number of lines in each file for a list of samples, and report a table of counts collected from each sample. Define a rule to process each input file, and a rule to collect the results.

I find this much less complex than an equivalent bash workflow. Additionally, these rules can be easily containerized, the workflow can be parallelized, and the workflow is robust to interruption and the addition of new samples. Snakemake manages checking for existing files and running rules as necessary to create missing files, logic that is much more finicky to implement by hand in bash.

    with open('data/samples.txt') as slist:
        SAMPLES = [l.strip() for l in slist.readlines()]
    
    rule all:
        input:
            "results/line_counts.txt"
    
    rule count_lines:
        input:
            "data/lines/{sample}.txt"
        output:
            "processed/count_lines/{sample}.txt"
        shell:
            """
            cat {input} |
              wc -l | 
              paste <(echo -e {wildcards.sample}) - > {output}
            """
    
    rule collect_counts:
        input:
            expand("processed/count_lines/{sample}.txt", sample=SAMPLES)
        output:
            "results/line_counts.txt"
        shell:
            """
            cat <(echo -e "sample\tn_lines") {input} > {output}
            """


This looks like... an unconstrained amalgamation of YAML, python, and zsh/bash. Knowing all these building blocks quite well, I cannot say I find this attractive at all. Seems like it is bound to incur all the problems that having everything in YAML cursed configuration management and deployment orchestration with in ansible.


Super cool! Would love to see an integration with Oxen and their data version control https://github.com/Oxen-AI/oxen-release


Is Snakemake is more reproducible than Python source code with pinned dependencies?


I use Nextflow but I assume Snakemake has the similar functionalities.

In short, writing Python code handling often +10 separate steps with external binaries robust enough to restart the pipeline with say 15 out of 100 files failed at step 4 and another 10 which did not pass step 7 is complicated. You need way less and easier to understand code using Nextflow




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: