Hacker News new | past | comments | ask | show | jobs | submit login
Choosing a Language for Bioinformatics (blot.im)
5 points by totalperspectiv on May 6, 2019 | hide | past | favorite | 3 comments

Cool write up! I am a huge fan of Nim. I use it as a replacement for small scripts, big scripts, and even real code at work, to interface with other languages that interface with C too.

I see in your post that you are about to look into Nim/C interop; have a look at nimterop[1]. It has been immensely useful to me for wrapping Nim libraries around C code exported from Matlab and to talk with C API for SystemVerilog.

Finally, you blog post shows a snippet of Python code, but also show to the world how the Nim code looks [I know that you are sharing the link to the git repo containing the Nim code, but still] :)

[1]: https://github.com/nimterop/nimterop

Thanks for the link! I will certainly be checking that out! I'm sure there will be a follow up with Nim code front and center :)

big.tsv is basically the same line repeated 2 000 000 times:

  $ awk \
 'BEGIN{for (i=0; i<2000000; i++){print "abcdef\tghijk\tlmnop\tqrstuv\twxyz1234\tABCDEF\tHIJK\tLMNOP\tQRSTUV\tWXYZ123"}}' \
   > big.tsv

  $ cat big.tsv |sort|uniq -c
  2000000 abcdef ghijk lmnop qrstuv wxyz1234 ABCDEF HIJK LMNOP QRSTUV WXYZ123
You can first sort all the lines, and count them. The first column is now the number of instances of that line.

The idea is that if 2 lines are identical, the number of "count++" emitted will be exactly the same.

I took your gawk code, but I am starting the for loop at 2 instead of 1, and I removed the -F'\t' option.

  $ time cat big.tsv |sort | uniq -c \
  | gawk '{for (i=2; i <= NF; i++) {if (index(tolower(substr($i, 1, 3)), "bc") != 0) {count += $1}}}END{print count}'

  real 0m0.724s
  user 0m0.605s
  sys 0m0.322s

Edit: added backslashes to split lines.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact