
Choosing a Language for Bioinformatics - totalperspectiv
http://ducktape.blot.im/choosing-a-language-for-bioinformatics
======
kaushalmodi
Cool write up! I am a huge fan of Nim. I use it as a replacement for small
scripts, big scripts, and even real code at work, to interface with other
languages that interface with C too.

I see in your post that you are about to look into Nim/C interop; have a look
at nimterop[1]. It has been immensely useful to me for wrapping Nim libraries
around C code exported from Matlab and to talk with C API for SystemVerilog.

Finally, you blog post shows a snippet of Python code, but also show to the
world how the Nim code looks [I know that you are sharing the link to the git
repo containing the Nim code, but still] :)

[1]:
[https://github.com/nimterop/nimterop](https://github.com/nimterop/nimterop)

~~~
totalperspectiv
Thanks for the link! I will certainly be checking that out! I'm sure there
will be a follow up with Nim code front and center :)

------
sebhtml
big.tsv is basically the same line repeated 2 000 000 times:

    
    
      $ awk \
     'BEGIN{for (i=0; i<2000000; i++){print "abcdef\tghijk\tlmnop\tqrstuv\twxyz1234\tABCDEF\tHIJK\tLMNOP\tQRSTUV\tWXYZ123"}}' \
       > big.tsv
    
      $ cat big.tsv |sort|uniq -c
      2000000 abcdef ghijk lmnop qrstuv wxyz1234 ABCDEF HIJK LMNOP QRSTUV WXYZ123
    

You can first sort all the lines, and count them. The first column is now the
number of instances of that line.

The idea is that if 2 lines are identical, the number of "count++" emitted
will be exactly the same.

I took your gawk code, but I am starting the for loop at 2 instead of 1, and I
removed the -F'\t' option.

    
    
      $ time cat big.tsv |sort | uniq -c \
      | gawk '{for (i=2; i <= NF; i++) {if (index(tolower(substr($i, 1, 3)), "bc") != 0) {count += $1}}}END{print count}'
      4000000
    
      real 0m0.724s
      user 0m0.605s
      sys 0m0.322s
    
    

Edit: added backslashes to split lines.

