

Show HN: Helix I/O – Ultrafast genomic search - nkrumm
https://demo.helix.io

======
vsbuffalo
This looks neat, but is the method open? Is there a command line version?

I'd argue that while open source is a desirable feature in most software, it's
a _necessary_ feature in scientific software. When our results depend on
running through software, that software can't be a proprietary, closed-source
black box.

~~~
nkrumm
We're working on polishing a command-line version of the tool, amongst other
features. But we'd love to get in touch with users/researchers looking to run
some large scale projects (our intent is to have an open, free-for-academic-
use product).

------
nkrumm
OP here. We've been working on faster bioinformatics algorithms for the past
year – this is a demo of one application which is effectively a fast
replacement for BLAST. Happy to answer any questions about tech or usage.

~~~
kevinwuhoo
This is really cool. I'd really love to set this up for use in the lab I work
with. Are you guys planning on releasing a binary or web interface where I can
upload my own *omic data for searching?

~~~
sakai
Hope it was clear that you can test it out with some small FASTA or FASTQ
files now (just drag onto the upload icon).

And yes, we'd be happy to discuss your use case and get you set up with
something. Just shoot me a note at nick at helix dot io and we can follow up
off thread.

------
aroch
We help use the NMPDR RAST servers and also run our own for sequence
annotation. I'd have to check but if I'm remembering correctly our annotations
fix errors (~%5) present in the NCBI databases and catches many more
(something like ~20% of unknowns) for the organisms we work with. Problem is
searching our indexes is super slow (comparatively) and not all in one place
-- BLASTn/p takes some fiddling.

Could we use our own corpus for this?

Ah, I see you have some problems with species identification...are you using
old NCBI datasets?

(ps. If you'd like to talk, you can email me at the address in my profile)

~~~
sakai
Yes, we're building out the features to make it easy to use with your own
corpus. I'll follow up by email, but can you clarify what species
identification issues you're seeing? (Would like to double check and get it
fixed asap if there's a problem).

Thx!

~~~
aroch
It's a problem with NCBI, they have a bunch of H. haemolyticus and Hflu typed
incorrectly. In fairness, I don't think NCBI is really on top of fixing those.

------
Jemaclus
This is very cool. I love data, all kinds of data. The more data we have
available, the better.

But... what's the use case for this? Are there people with genomic sequences
just laying around with no idea of what they represent? Or is it a resource
for students?

(Serious question.)

~~~
cfontes
If I got the tool right, one use case would be similarity search.

So you just finished the brand new genome sequencing of Sugarcane for example.

And you are interested in Aluminium resistence genes to be able to plant
sugarcane in aluminium packed soils that would normally kill the plant.

What you do is to throw parts of the genome in a tool like that (there are
others) and see if you can find a match for a Sorghum(Wheat,Corn...) Aluminium
resistence gene that has been already researched and tested. The closest
related the better.

Now you have some clue about where to start your studies or which genes you
should be looking for in your new varieties or which gene to try to insert in
your next Agrobacterium or gene bombardment test.

Cheers.

PS: I left the Bioinformatics field a few years ago (best job I ever had BTW)
So correct me if I am wrong.

~~~
sakai
cfontes – Yes, that's definitely another use case of the underlying
technology, though please note that the database loaded for the demo only
include archaeal, bacterial and viral genomes (so you'd have a very hard time
getting a hit on sugarcane sequences!).

(This is nkrumm's and my project)

~~~
cfontes
Nice...

Thanks for the explanation and good luck.

------
amuresan
Hi and congratulations, this looks like an excellent tool for bioinformatics.
Just out of curiosity, how can you mathematically guarantee constant time
searches independent of the database size?

~~~
sakai
That was an attempt to say (without resorting to Big Oh notation) that the
data structures and algorithms we're working on provide O(1) lookups that take
a fixed amount of time regardless of the reference / dataset size. Does that
make more sense?

~~~
amuresan
I understand what it means, but what I don't understand, and was curious
about, is how you can achieve constant-time searching or the dataset. I
started with the assumption that you're doing approximate matches, is this
correct?

~~~
sakai
Ah – well to clarify, we're performing exact match lookups on individual
k-mers (k-length genetic strings), and then computing a classification result
from those exact searches.

~~~
amuresan
Thank you!

------
macarthy12
What stack are you using? Are you using anything like elasticsearch or non-sql
Dbs ?

------
elwell
Would be a lot more fun if it had eukaryotes.

~~~
nkrumm
We agree! We're working on improving the scope of the reference database. Out
of curiosity, what kind of use case do you envision for eukaryotes? (Amoeba,
etc?)

