
Modeling Science as a Directed Graph - ftong
https://benchling.engineering/modeling-science-as-a-directed-graph-bfc149d4502b
======
markhollis
There's actually a subfield called 'scientific workflow systems' (SWS). An
example of such a system is Kepler [1]. What the author(s) ask at the end of
the article seems to me basically that they want process provenance. This is a
fairly well studied problem.

[1]:
[https://en.wikipedia.org/wiki/Kepler_scientific_workflow_sys...](https://en.wikipedia.org/wiki/Kepler_scientific_workflow_system)

~~~
myxie
+1 to this comment. It is a reasonably sized body of research, too. On top of
this, scientific workflow scheduling maps to the classic scheduling problem,
which is known to be NP-hard - the fact that the authors infer that automating
the traversal is all that is necessary suggests that they have an incomplete
understanding of the problem at hand.

For those interested, a nice list of workflow managers, engines, and even
languages can be found at [https://github.com/pditommaso/awesome-
pipeline](https://github.com/pditommaso/awesome-pipeline).

------
DonbunEf7
Reminds me of ologs:
[https://arxiv.org/abs/1102.1889v2](https://arxiv.org/abs/1102.1889v2)

The category on top of a given graph just adds path equivalences. While that
might not seem like a big addition, it's crucial for restricting the
structure's internal state.

~~~
mncharity
Thanks. In a similar vein, there's a recent _Knowledge Representation in
Bicategories of Relations_
[https://arxiv.org/pdf/1706.00526.pdf](https://arxiv.org/pdf/1706.00526.pdf) .

------
mncharity
It seems planning of "opportunistic" software projects faces related
challenges? Though more like research direction exploration/planning, than
research production-process management.

Gantt is ok when simple time and resource constraints dominate, and there's a
simple path. But not all "project-design spaces" collapse that simply.

Even small opportunistic projects can have painfully large dependency graphs.
"If/when browser bug X is ever fixed", "if I find a better algorithm", "if we
get hardware X", "if/when these conditions are ever met, budget this amount of
work along these vectors", "we could go this way or that", "here are some
associated risks", "possible risk mitigations/explorations", "X is a candidate
goal state", "with some path dependencies", "and can pay for this pattern of
costs". "Y is an alternative candidate goal state", "this area of project
space has a bunch of scattered payoffs", "sweeping this way is sparse on
payoffs until the risky big one", ... and on, and on.

I've never seen tooling which wasn't ghastly painful for sketching out stuff
like this, editing it, and keeping it up to date. I've hope for building
something in VR. But perhaps I missed something?

------
hyperion2010
One problem with modeling workflows as dags is how to come up with reasonable
ways to express common control flow elements. That said, for the parts of
processes that are dependency trees they are the right fit.

~~~
alangpierce
(I work on the Benchling workflows product.)

What kinds of control flow elements do you have in mind?

I know one we've run into a few times is "repeat this step as many time as is
necessary", like the repeated stirring example in the article. There's some
nuance here because there are actually two graphs backing a workflow: the
"configuration" graph that describes the experimental design, and the
"execution" graph that describes what actually happened. In this case, the
configuration graph would have a single node with a self-loop for stirring,
and the execution graph would have a variable number nodes, one after the
other, based on how many times the stirring needed to happen.

The configuration graph might have cycles or more abstract control flow
mechanisms, but I think the execution graph is always a DAG because it's
describing what happened in the real world. Configuration graphs are useful at
the design step, but the execution graph is probably more useful for
coordinating the experiment and analyzing how it went, and we've found that in
practice, scientists might need to make unexpected ad-hoc changes a week into
a month-long experiment, so trying to capture everything up-front in the
configuration graph can sometimes not really be reasonable in the first place.

One aspect around control flow that makes it a little easier for us at the
moment is that it's not DAGs all the way down. Each graph node corresponds to
a lab notebook entry of instructions performed by a scientist, so any complex
details around operating instruments and collecting results are just
represented as human-readable instructions in a reusable template. For higher-
level handoffs between scientists and teams, though, we've found that the DAG
model works out pretty well.

~~~
fabian2k
I'm a bit confused about what the nodes are. You're saying they are the
individual lab notebook entries, but then I wonder where the results of the
experiments are in this graph?

I would have put the instructions on what was done on the edges of the graph,
and the results or measurements on the vertices. But I'm not sure I'm
correctly understanding your model.

~~~
alangpierce
Heh, there are a lot of ways to model it, and we've iterated on the details
quite a bit and I imagine we'll iterate more in the future, but here's a quick
explanation:

In our model, a node is a "run", which is an experimental procedure performed
on a sample (or sometimes multiple samples). Each run has input samples,
output samples, and structured data results. A notebook entry can have
multiple runs associated with it, since a scientist will likely be processing
many runs at once, and every run lives in exactly one notebook entry.

A common use case is to take some input sample, perform some transformation on
it, and produce a new output sample that will be used downstream. Another
common use case is a "screening" step where you take a collection of samples
(one run each), run some analyses on those samples, and discard the ones that
don't meet certain criteria. So sometimes the output is a new sample,
sometimes it's the same as the input, and sometimes there's no output.

An edge indicates that the output of one run will be automatically fed into
the input of the next run, and generally shows the flow of physical samples
through the different stages.

------
amelius
What kind of (open source) tools already exist to read through papers/excel
files, etc. and extract meaning, in the form of a graph?

