
Seq – A Language for Bioinformatics - lelf
https://seq-lang.org/
======
smabie
As a former bioinformatician (if that’s a word), I’m not sure there’s much
value in this. There’s high dispersion in the performance requirements of
bioinformatics tools. The processes that need to be fast (alignment, BLAST,
whatever, tree creation, etc) are already super fucking optimized (though,
unfortunately, still slow). The things that don’t need to be fast can use
whatever you want (I used Haskell and Racket for my own tools at the time).
Python is.. not the greatest. The major value add is the multitude of
scientific libraries. If you’re gonna throw that all away, why not just use
something better? Things like Julia, OCaml, Haskell, etc. I personally think
Julia is pretty dope and is what I would use today for bioinformatics
research. Or maybe if I was feeling a little subversive, K/Q or J. Q’s time-
series database KDB+ could probably be used for sequences. And maybe even for
great effect. And the performance would be off the charts!

It seems like the purpose of this is Python without the performance penalty,
which doesn’t make much sense to me. I’ve found Haskell absolutely perfect for
bioinformatics as most operations you are doing are functional data
transformations. Moreover, it’s pretty damn fast if you need it to be.

I’ve been out of the field a long time though (Roche-454 was still the main
workhorse at the time). But let me tell you, bioinformatics is/was a fucking
shit-show. The tools and ecosystem are/were like Linux in the mid 90s: fucking
terrible. And another language is just gonna make it worse.

------
inumanag
Hi guys,

one of the authors (Ibrahim) here! Thanks a lot for the comments--- we
definitely appreciate them!

A quick explanation why we built Seq:

\- We were not happy with the existing bioinformatics libraries for various
reasons. And honestly, while Julia is amazing project (and we do talk to the
Julia team from time to time as they are located two floors above our office
at MIT), it never 'clicked' with us or many other people in the field.

\- While the main application domain is bioinformatics (that's where we came
from), Seq is pretty much a strongly typed statically compiled Python. One of
the main goals we had was to push the boundaries of how much stuff in Python
can be deduced by compiler, as we loved the Python's syntax (Seq is to Python
as Cystal is to Ruby--- or at least that is what we are aiming for).

\- We do not want people to learn a new language--- Seq should be pretty much
a drop-in replacement for Python, at least for most scientific/bioinformatics
software. There still remains a small gap, but we are actively working to
close it.

\- At some level, Python libraries cannot cut it, especially when dealing with
next-gen sequencing data. Also, owning the whole stack gives us the control to
perform low-level pipeline optimizations. Chief example is out prefetch
statement that is rather hard to implement in other languages.

Also, check out the paper
([https://dl.acm.org/doi/10.1145/3360551](https://dl.acm.org/doi/10.1145/3360551))
for more information. Let me know if you cannot access it for various reasons.

~~~
dalke
What does "first language tailored specifically to bioinformatics" mean?

I ask because I'm pretty sure there have been earlier languages for
bioinformatics. Just looking around I see:

\- "Composable languages for bioinformatics: the NYoSh experiment",
[https://peerj.com/articles/241/](https://peerj.com/articles/241/)

> In order to evaluate the potential of language workbenches in
> bioinformatics, we tested a prominent workbench by developing an alternative
> to shell scripting. To illustrate what LWBs and Language Composition can
> bring to bioinformatics, we report on our design and development of NYoSh
> (Not Your ordinary Shell). NYoSh was implemented as a collection of
> languages that can be composed to write programs as expressive and concise
> as shell scripts.

\- "Large-scale data integration framework provides a comprehensive view on
glioblastoma multiforme",
[https://genomemedicine.biomedcentral.com/articles/10.1186/gm...](https://genomemedicine.biomedcentral.com/articles/10.1186/gm186)

> Workflows are constructed using a custom workflow language called
> AndurilScript that resembles traditional programming languages and is
> designed to enable rapid construction of complex workflows. ... These
> characteristics of Anduril facilitate scientists without bioinformatics
> training to interpret complex data sets, such as TCGA.

\- "SARVAVID: A Domain Specific Language for DevelopingScalable Computational
Genomics Applications",
[https://engineering.purdue.edu/dcsl/publications/papers/2016...](https://engineering.purdue.edu/dcsl/publications/papers/2016/final_sarvavid_ics2016_submitted.pdf)

> In this paper, we have presented Sarvavid and showed implementations of five
> popular genomics applications, BLAST, MUMmer, E-MEM, SPAdes and SGA in it.

I see the last includes K-merization, like Seq does.

~~~
mhenders
Love your bio, and those are some cool links.

To your first question, seeing that in context of their paper abstract, I
think the emphasis should be on the last part of the sentence, "...which
marries the ease and productivity of Python with C-like performance."

~~~
dalke
There's a comma before that clause, which suggests that it's not meant to be
interpreted as "the first bioinformatics-specific programming language to
marry C and Python" but rather as "the first bioinformatics-specific
programming language (oh, and it marries C and Python)."

Which bio? My most recently updated one is
[http://mmpdb.dalkescientific.com/about.html](http://mmpdb.dalkescientific.com/about.html)
.

~~~
mhenders
[http://www.dalkescientific.com/company.html](http://www.dalkescientific.com/company.html)

Boy, what a body of work. Would love to hear your take on Seq if you get a
chance to dig in more. I come from more of a Python background and really like
what I see:

    
    
      - Python syntax for the most part
      - Mypy style type annotations
      - |> for pipelining and ||> to parallelize
      - match constructs (ML-ish with guards)
      - s'ACGT' and k'ACGT' for sequences and k-mers
    

It's relatively early in the project but I think the choices they have made
give them a chance to continue to build momentum.

~~~
dalke
Thanks for the pointer. And praise. I have several bios out there, for
different purposes. I should update that one .... and the web site. Been
meaning to do that for a long time. _sigh_

I don't have the interest to evaluate Seq. I only did bioinformatics for a
couple of years, and that was almost 20 years ago. I am incapable of giving it
a serious look. My cursory look is that it's only really meant for DNA, and
highly optimized for the specific k-mer benchmark compared.

My personal opinion is that developing a new programming language is a hard
and almost thankless task. Yet they are also so very fun to do.

Back in the 1980s and 1990s, a number of computational life sciences packages
(eg, Sybyl Programming Language (SPL) from Tripos, X-PLOR, Scientific Vector
Language (SVL) from CCG) developed their own domain specific languages. They
were successful (IMO) more because of the tools available through them than
the language themselves.

In the early 1990s I started work on VMD. We had a simple scripting language.
As one did back then. I found that Tcl was similar, embeddable, more powerful,
and cleaner than anything we did, so we switched to that.

As a result, we got a lot of things "for free", like easy access to third-
party packages designed for Tcl, and the faster bytecode engine from Tcl 8.

I switched to using Python full-time in the late 1990s, and again saw just how
much I got "for free" by using a language designed by people with software
language design and implementation expertise, and with the vast set of support
packages.

Seq also gets some things "for free" because it builds on LLVM.

But since it's completely independent of the CPython run-time and garbage
collection model, it's impossible to build on any of the existing modules.

------
hoytech
I'm especially interested in how sequence searching and matching work in
libraries like this. Seq has a "match" statement for this task, which
implements ACGT characters and _ for a single wildcard base and "..." for
multiple wildcard bases, and a recursive matching system I haven't quite
grokked yet.

Personally I'm more comfortable with a regular expression syntax, so would
prefer "." and ".*". Actually, even better than "." is "N" from the IUPAC
notation:
[https://en.wikipedia.org/wiki/Nucleic_acid_notation](https://en.wikipedia.org/wiki/Nucleic_acid_notation)

The IUPAC notation is nice because it standardises "character classes" for
working with nucleic acid sequences. For example, "B" is "[CGT]".

I wrote a module a while ago for searching nucleic acid sequences with
regexps:
[https://metacpan.org/pod/Bio::Regexp](https://metacpan.org/pod/Bio::Regexp)

When working with sequences there are a bunch of things to think about that
aren't really obvious from other types of data (or at least weren't obvious to
me!)

Exhaustive search: Often a regexp can match in many ways, and most regexp
systems don't provide a way to get a complete list of them all. Fortunately
there is the Regexp::Exhaustive perl module which is what I used. The way this
module works is pretty awesome. It adds a special "FAIL" directive to the end
of a pattern, so that the match can be recorded and artificially failed,
triggering the regexp engine's backtracking mechanism to back up and find the
next match (if any).

Reverse complements: Because DNA is double-stranded (well, usually... this is
biology after all) there is a "complementary" pattern on the first strand that
corresponds to the pattern you are interested in on the other strand. You
almost always need to search for both. And what's more, DNA is directional so
you actually need to search in the reverse direction for this complementary
pattern. You can either reverse and complement your sequence (which seq has a
special ~ operator for, neat!) and search again, or reverse and complement the
search pattern itself, assemble a single combined regexp (Regexp::Assemble
module), and do a single scan over the data, which is what Bio::Regexp does.

Circular DNA: Some DNA (plasmids) are actually circular in shape, meaning the
start is connected to the end. So a comprehensive search needs to check for
cases where the desired patterns span the arbitrary location selected as the
"start" in your sequence.

~~~
dalke
"Often a regexp can match in many ways ..."

When I worked in bioinformatics, some 20 years ago, I implemented a SWISS-PROT
to regex converter. I ended up having to ask SWISS-PROT was if certain
patterns were meant to be greedy or lazy, since the documentation wasn't
clear. I've since forgotten the answer, but I have a vague memory that they
hadn't really considered that multiple interpretations were possible, so they
probably expected greedy matches.

~~~
jerven
Parsing the flatfile of Swiss-Prot in detail is the best way to lose your
sanity ;) I know because that is part of my day job.

It's gotten better but I highly recommend you either use the RDF or XML. The
flatfile looks easy but is an endless source of bugs and urgent changes.

~~~
dalke
I did say it was 20 years ago. ;) Since then I've been working in
cheminformatics. Which has its own forms of insanity.

~~~
jerven
Can I just say that these day's you can search Swiss-Prot with InchiKey's due
to our integration with rhea
([https://www.uniprot.org/news/2018/12/05/release](https://www.uniprot.org/news/2018/12/05/release))

------
chromatin
Fascinating, but there exist already htslib [0] bindings for Python (and many
other languages). htslib truly is the standard library with respect to high-
throughput sequencing data file access, and with high level bindings, we can
already write something like:

``` for seq in bamfile: print(seq.pos) ``` or whatever.

[0] [https://github.com/samtools/htslib](https://github.com/samtools/htslib)

~~~
snackematician
To expand on this -- I believe pysam [0] is the official htslib interface for
Python.

It's quite good. In addition to providing Python bindings, it also provides
extra functionality from within Cython, if you need additional features/speed
but don't want to drop all the way down to C.

[0]
[https://pysam.readthedocs.io/en/latest/](https://pysam.readthedocs.io/en/latest/)

~~~
mbreese
Is the pysam API more stable now? Years ago, I had to stop using it because
the API has changed between versions too much.

------
xvilka
Why not just contribute to BioJulia[1][2] instead? It is a more mainstream
language but the one that fits computational tasks better than Python.

[1] [https://biojulia.net/](https://biojulia.net/)

[2] [https://github.com/BioJulia](https://github.com/BioJulia)

~~~
brofallon
As a full-time bioinformatician I'm totally excited about Julia for
bioinformatics. Currently BioJulia needs some pretty serious work before it
can be used for routine bioinformatics tasks, for instance there's no good
support for the CRAM format. Nonetheless if some of this basic functionality
were implemented I wouldn't be surprised to see much greater adoption

------
neeeeees
Like a few others have mentioned, I do wonder how big the marginal return in
baking in these features is.

From a quick look, stuff like FASTA/BAM parsing, translation, etc can be
implemented in C-land a la numpy, and called from Python, right?

A language like Swift would also support the addition of powerful user-defined
operators, as in the case of Swift for TensorFlow[1].

Language adoption is hard to drive, and I wonder if having domain-specific
library calls built-in is worth the added effort for people in the field.

[1] [https://www.tensorflow.org/swift](https://www.tensorflow.org/swift)

~~~
Someone
Nitpick: Swift for TensorFlow currently is a fork of Swift, not a library
implemented in Swift.
[https://github.com/tensorflow/swift](https://github.com/tensorflow/swift):

 _”Compiler and standard library development happens on the tensorflow branch
of the apple /swift repository.

[…]

Swift for TensorFlow is not intended to remain a long-term fork of the
official Swift language. Language additions are designed to fit with the
direction of Swift and will go through the Swift Evolution process.”_

~~~
neeeeees
That is correct, and I should have made that clearer. However, a lot of it’s
features are implemented in Swift (via custom operators, overloading etc) in
addition to compiler intrinsics.

The point I had in mind was that a language like Swift may allow for both
power and DSL-like expressiveness without requiring learning an entirely new
language. In fact, standard arrays are implemented in Swift code
([https://github.com/apple/swift/blob/master/stdlib/public/cor...](https://github.com/apple/swift/blob/master/stdlib/public/core/Array.swift))
yet feel closer to say a Python list than a Java ArrayList from the user’s
perspective.

------
StefanKarpinski
Response from the BioJulia developers: [https://biojulia.net/post/seq-
lang/](https://biojulia.net/post/seq-lang/). High-level points:

* they were able reproduce the performance comparison between Seq and BioJulia

* BioJulia spends most of its time on these benchmarks validating and transcoding data into a more compact, efficient representation of gene sequences

* Seq, on the other hand, operates on the raw ASCII input data and does no validation

* BioJulia devs implemented Julia types that representing gene sequences the same way as Seq does in less than 100 lines of Julia code [1]

* when using the same representation in as Seq, BioJulia was significantly faster than Seq

* BioJulia devs were able to further optimize transcoding of gene sequences to get a 10x performance improvement [2]

* with these improvements BioJulia reaches similar performance to Seq while still doing validation and using less memory

The full post is well worth reading. For me the main takeaway is that there's
no real need for a domain-specific language at least not based on these
results. Julia is already a great language for this kind of work and you get
C-like speed and JIT compilation for free.

[1]
[https://github.com/jakobnissen/SeqLangBenchmarks/blob/master...](https://github.com/jakobnissen/SeqLangBenchmarks/blob/master/SeqJL.jl)

[2]
[https://github.com/BioJulia/BioSequences.jl/issues/86](https://github.com/BioJulia/BioSequences.jl/issues/86)

------
helen___keller
My personal impression isn't that bioinformatics needs a full language, but
more tools in popular environments to lower the entry barrier for good
software engineering practices.

Every bioinformatics codebase I've looked at has been a downright mess.
Basically an ad hoc collection of scripts that transform data this way, maybe
rendering some graphs or such, relying on 100 unstated assumptions. Nothing is
maintainable, and often rely on messy approaches like loading your entire data
set into memory (works fine for your 1-10GB data set, then not so much on a
larger one) or what I would gently describe as mainframe compute in place of
real software engineering.

~~~
downerending
I'm reminded of this rant from Richard O'Keefe (of Prolog fame). It was a
while ago, but little has changed.

"I don't know what art these [bioinformatics] programs are state-of; possibly
macrame. They certainly aren't even 1970's state of the programming art."

"Is it reasonable to expect people with a biochemistry or mathematics
background to write clean well-engineered code? No. For the importance of the
topic, and the sums of money involved, is it reasonable to expect that they'll
have their programs cleaned by someone else before release? I think it is.
With the pervasive lack of quality I'm seeing, I don't trust _any_ of the
results of these programs. I have to wonder how many published results
obtained using these programs (and fed back into databases that are used to
derive more results which are ...) are actually valid."

[http://catless.ncl.ac.uk/m/risks/21/98#subj8.1](http://catless.ncl.ac.uk/m/risks/21/98#subj8.1)

------
gandalfgeek
I made a short explainer video on the language here:

[https://youtu.be/5bk4Wc5Op2M](https://youtu.be/5bk4Wc5Op2M)

~~~
truculent
this was great - thank you

------
parsimo2010
This should be marketed as a language for _genomics_ not bioinformatics. There
is more to bioinformatics than just genomics, but this language (at least the
beginning of the docs that I looked at) seems to be marketing exclusively for
genomics analysis.

Not a knock on the project at all, but it doesn't seem like someone doing
analysis on cell images is going to get much out of this language.

------
dannykwells
As a working biological data scientist... This is useless until a tool I need,
which I can't get anywhere else, is written in it.

My suggestion would be to rewrite some of the most popular tools, like bwa, in
this language, and show the comparative performance etc.

Then, write a comprehensive open source package with great documentation and
maintenance in this language, to demonstrate to others your investment.

Then, maybe, it will get some traction. But honestly, C, R and Java are so
embedded it will be a hard road.

~~~
mhenders
Traction is incredibly difficult, but there seem to be some good signs. I'm
not a working bio data scientist, but I think the project so far has shown
good taste on what to recreate vs. what to make easy to interface with. e.g.
[https://seq-lang.org/tutorial.html?highlight=bwa#calling-bwa...](https://seq-
lang.org/tutorial.html?highlight=bwa#calling-bwa-from-seq)

------
glofish
Since in bioinformatics, some processes are vastly more time-consuming than
others it is not clear what the benefit of designing a new language is as
opposed to adding an optimized, C library into Python itself.

What functionality would not work the same way if this were a python library?

That being said, it is a neat and impressive effort, I do fear though that it
will have a whole lot less uptake as now requires bioinformaticians to learn a
third language (Python, R and now seq)

Historically I note a similarity to the 'Mothur' and 'Qiime' split.

'Mothur' is a dynamic language similar to 'seq' whereas 'Qiime' is Python glue
across many libraries. Frankly, I like 'mothur' better, but 'Qiime' is whole
lot more popular.

------
prirun
I'm a bit surprised at all the negative comments here. I hope it isn't too
discouraging for your team, because as author of a 50K LOC Python app
(HashBackup), I could really use this! I love the Python language but
sometimes the performance is a drag. For example, to plan a restore when the
data isn't local, HashBackup has to traverse every block in every file to be
restored and figure out when to load the block and when it can be released
from the cache. This isn't particularly difficult, but for very large restores
it requires long loops using large lists, arrays, and/or dicts. Parts are
coded in Cython, and that works well for easily-isolated functions, but not so
great for something like the restore plan that needs database access and is
referenced during the restore.

I ran a small 10M entry {int:int} dict benchmark. In Python 2.7, the test used
1.1GB of RAM and about 8 seconds. In D (fully compiled) the same test used
881MB and 7.4 seconds. Here's the D version:

    
    
      import std.stdio : writeln;
    
      int main() {
        int[int] map;
    
        foreach (i; 0..10_000_000)
          map[i] = i;
    
        foreach (i; 0..3) {
           foreach (j; 0..10_000_000) {
             map[j] = map[j] + 1;
           }
        }
        return 0;
      }
    

In Seq it ran in 5 seconds and used 395MB. Here's the test program and Seq
run:

    
    
      map = dict[int,int]()
      for i in range(0,10000000):
          map[i] = i
      for i in range(3):
        for j in range(10000000):
          map[j] = map[j] + 1  
      print map[12345]
    
      [root@hbseq ~]# /usr/bin/time -v ./map
      12348
        User time (seconds): 4.67
        System time (seconds): 0.55
        Percent of CPU this job got: 97%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:05.36
        Maximum resident set size (kbytes): 395532
    

Looks pretty great to me, especially if I don't have to do a major rewrite! I
guess I could have hit a case where Seq happened to have a higher HT load for
10M entries and D just did a resize, so it would be good to run the same kind
of test at a lot of different hash table sizes. But the Python results are
pretty terrible space-wise.

------
iddan
I don't really understand why to create a new language. But I like the
improvements they made for the Python lang

~~~
AllegedAlec
Because this is the precise situation for which you'd want a domain specific
language. You have a group of people who are not expert programmers, so pure
C/C++ is quite annoying to write, and python's performance on the type of
datasets used by these researchers is abysmally slow.

------
zoom6628
As someone who noodles around in python, and knows nothing about
bioinformatics, this looks very interesting for the 'nice' things it does to
basic python like forcing single type of returns, array controls. Its almost
like a 'safer-python' set of constructs that would be of great use to the
general python programmer. And the pipe operator is very cool.....gotta try
that out soon.

------
Kim_Bruning
Hmm, it's interesting that you have DNA base sequences support built in. But
(you know someone was going to ask ;-) ) I don't see similar support for Amino
Acid sequences, or encoding/decoding between the two. Is this a deliberate
design choice?

~~~
mhenders
I'm not sure this is what you are looking for but there is an example in their
docs of dna-to-protein.

[https://seq-lang.org/cookbook.html?highlight=protein#dna-to-...](https://seq-
lang.org/cookbook.html?highlight=protein#dna-to-protein-translation)

~~~
jerven
Yeah, but not all organisms use the same translation.
[https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/inde...](https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=tgencodes)

So an unnamed function without argument might easily give the wrong thing.

------
sansnomme
It is like a Julia-Nim hybrid.

------
totalperspectiv
I'm curious how this differentiates from Nim other than the builtin types for
bfx stuff. Still cool either way and great to see something else joining this
space.

------
vmchale
How does it compare to Futhark/J? I'm especially curious how it compares to
calling Futhark from Python. Are there special data structures?

~~~
hyperbovine
It has a better name.

~~~
_hl_
Off-topic, but futhark is such a brilliant and well-executed idea, if only it
had a better name.

~~~
abecedarius
What's wrong with the name?

~~~
vmchale
It's hard to google? That has been my complaint.

------
leafmeal
Heterogeneous collections and no inheritance/polymorphism seems like a bad
combination.

------
Pandaburgers
Is it possible to write a python module with seq?

~~~
pure_simplicity
If it is possible to write python modules in Seq as easily as writing plain
Seq, that would give Seq a great possibility for benefiting from the
popularity of Python.

Interoperability would increase adoption. Library writers and maintainers
would sooner consider switching to Seq for performance reasons, while still
being able to use those libraries from Python.

------
unixhero
This is similar to Crystal?

------
ComputerGuru
Obligatory “this name is already in use in this field” post: Seq is the name
of a very popular structured logging sink that provides a web app with a query
language interface for searching through and graphing log streams, often
paired with Serilog when used in the .NET world.

