
Dgsh – Directed graph shell - nerdlogic
http://www.dmst.aueb.gr/dds/sw/dgsh/
======
chubot
This looks pretty interesting, although I'll have to dig more into the
examples to see why they chose this set of primitives (multipipes, multipipe
blocks, and stored values).

Here is a 2009 paper, "Composing and executing parallel data-flow graphs with
shell pipes", which is also a bash extension. (I'm impressed with anyone who
successfully enhances bash's source code.)

Although it has a completely different model and I think more suitable for
"big data".

[https://scholar.google.com/scholar?cluster=98697598478714306...](https://scholar.google.com/scholar?cluster=9869759847871430654&hl=en&as_sdt=0,5&sciodt=0,5)

[http://dl.acm.org/citation.cfm?id=1645175](http://dl.acm.org/citation.cfm?id=1645175)

 _In this paper we extend the concept of shell pipes to incorporate forks,
joins, cycles, and key-value aggregation._

I have a printout of this paper, but unfortunately it doesn't appear to be
online :-(

------
xiaq
I've always thought about integrating this functionality into elvish
[https://github.com/elves/elvish](https://github.com/elves/elvish) but cannot
cone up with a good syntax. dgsh has a good one, but unfortunately using &
breaks its traditional semantics. Does anyone has some idea of a tradition-
compatible grammar?

Also, to nitpick, this is more accurately called a directed acyclic graph
shell, or simply a DAG shell. The language doesn't seem to allow cycles. dagsh
reads nicer than dgsh too.

------
mtrn
I've worked with and looked at a lot of data processing helpers. Tools, that
try to help you build data pipelines, for the sake of performance,
reproducibility or simply code uniformity.

What I found so far: Most tools, that invent a new language or try to cram
complex processes into lesser suited syntactical environments are not loved
too much.

A few people like XSLT, most seem to dislike it, although it has a nice
functional core hidden under a syntax that seems to come from a time, where
the answer to everything was XML. There are big data orchestration frameworks,
that use an XML as configuration language, which can be ok, if you have clear
processing steps.

Every time a tool invents a DSL for data processing, I grab my list of ugly
real world use cases and most of the tools fail soon, if not immediately.
That's a pity.

Programming languages can be effective as they are, and with the exceptions
that unclean data brings, you want to have a programming language at your
disposal anyway.

I'll give dgsh a try. The tool reuse approach and the UNIX spirit seems nice.
But my initial impression of the "C code metrics" example from the site is
mixed: It reminds me of awk, about which one of the authors said, that it's a
beautiful language, but if your programs getting longer than hundred lines,
you might want to switch to something else.

Two libraries which have a great grip at the plumbing aspect of data
processing systems are airflow and luigi. They are python libraries and with
it you have a concise syntax and basically all python libraries plus non-
python tools with a command line interface at you fingertips.

I am curious, what kind of process orchestration tools people use and can
recommend?

~~~
dwhitena
Thanks for sharing your experience. I work with Pachyderm, which is an open
source data pipelining and data versioning framework. Some things like might
be relevant to this conversation are the fact that Pachyderm is language
agnostic and that it keeps analyses in sync with data (because it triggers off
of commits to data versioning). This makes it distinct from Airflow or Luigi,
for example.

~~~
samuell
Pachyderm, with its "git for big data" approach is one of, if not THE, coolest
thing I learned about in 2016.

Only I hope to get time to test it out in some more depth sooner rather than
later (it is one of my top goals for 2017).

Also, the pipeline feature in Pachyderm does not suffer from the "dependencies
between tasks rather than data" problem that I mentioned in another post here,
but properly identifies separate inputs and outputs declaratively.

Pachyderm specifies workflows in a kind of DSL AFAIK, and I'm very much
interested to see if it could natively fit the bill for our complex workflows.
But if not, I think we can always use it in a a light-weight way to fire off
scipipe workflows (instead of the applications directly), and so let scipipe
take care of the complex data wiring.

We would still like to benefit from the seemingly groundbreaking "git for big
data" paradigm, and auto-executed workflow on updated data, which should
enable something as impactful as on-line data analyses (auto-updated upon new
data) in a manageable way.

------
karlmdavis
This is perhaps a bit off-topic, but what I really wish more data
processing/ETL tools supported is the concept of transactional units. Too many
of them seem to start with the worldview that "we need to shove in as many of
the separate bits as we possibly can."

What's often needed for robust systems, instead, is solid support for error
handling such that "if this bit doesn't make it in, then neither does that
bit." Data is always messy and dirty, and too many ETL systems don't seem
architected to cope with that reality.

Of course, maybe I just haven't found the right tools. Anyone know of tools
that handle this particularly well?

------
visarga
I write complex shell commands every day, but when it gets longer than 2-3
rows I switch to a text editor and write it in Perl instead. I see no need to
use bash up to that complexity, doesn't look good in terminal.

Poorman version of multiple pipes is to write intermediate results into files,
then "cat" the files as many times as needed for the following processes. I
use short file names "o1", "o2" standing for output-1, output-2 and see them
as temp variables.

~~~
vinceguidry
This is what it comes down to to me too. Using the shell to do programming
seems to me like putting your job on hard mode.

When I had to do a lot of data processing at my last job, I started building
up tools in Ruby. If I had time, I'd hack the workflow so that the next time I
needed it, I could just run the tool from the command line.

Eventually I had a pluggable architecture that I could use to pull data from
any number of sources and mix it with any other data. Do that with a shell?
Why?

~~~
DSpinellis
The advantage of using the shell are the hundreds of powerful command-line
tools you can use. Increasingly, there are Perl/Python/Ruby packages that
offer similar functionality, but these require some ceremony to use and
therefore prohibit rapid prototyping and experimentation.

------
db48x
Funny, just two/three weeks ago I was saying that I really needed a dag of
pipes in a shell script that I was writing...

------
tingletech
Interesting, this seems to be from a couple of people at Information Systems
Technology Laboratory (ISTLab) at the Athens University of Economics and
Business. I wonder what the motivation is. Security, or does it utilize
multiple processor cores better than traditional pipes?

~~~
ufo
The impression I got is that it is still using traditional unix tools and
pipes under the hood so I would expect the same efficiency as now. I think the
big difference here is the syntax. Traditional shells are great if you have a
linear dataflow where each program has one standard input and one standard
output. However, if you want to have programs receiving multiple inputs from
pipes or writing to multiple pipes then the `|` syntax is not enough.

------
mtdewcmu
This looks like potentially a great tool. It might be helpful if the author
showed the code examples alongside the equivalent code in bash, so it's easy
to see both what the example code is doing and how much effort is saved by
doing it in dgsh.

~~~
nerdponx
It doesn't look all that different to me. Seems like it's just saving you mess
around with assigning function inputs and outputs to shell variables.
Otherwise it just looks like piping stuff around between functions.

~~~
DSpinellis
You can write many of the examples we provide in bash using tee and tee
>(process) syntax when you pipe data into multipipe blocks. To collect the
data from multipipe blocks you need to construct Unix domain named pipes and
use them in exactly the right order. It quickly gets complicated and ugly.
This is our fourth stab at the problem. The earlier ones generated bash
scripts, which looked awful and were unreliable.

------
be21
I am not familiar with the project. What are the advantages of Dgsh in
comparision to pipexec:
[https://github.com/flonatel/pipexec](https://github.com/flonatel/pipexec)

~~~
DSpinellis
Pipexec offers a versatile pipeline construction syntax, where you specify the
topology of arbitrary graphs through the numbering of pipe descriptors. Dgsh
offers a declarative directed graph construction syntax and automatically
connects the parts for you. Also dgsh comes with familiar tools (tee, cat,
paste, grep, sort) written to support the creation of such graphs.

------
CDokolas
Author's page:
[http://www.dmst.aueb.gr/dds/index.en.html](http://www.dmst.aueb.gr/dds/index.en.html)

------
haddr
I wonder if there is any perfomance benchmark of this graph shell? Especially
on some complex pipelines running huge datasets?

~~~
DSpinellis
We have measured many of the examples against the use of temporary files and
the web report one against (single-threaded) implementations in Perl and Java.
In almost all cases dgsh takes less wall clock time, but often consumes more
CPU resources.

------
nerdponx
Fun fact: "dgsh" is also the name of a CLI tool for DMs to manage RPG
campaigns: [http://dgsh.sourceforge.net/](http://dgsh.sourceforge.net/)

