Hacker News new | comments | show | ask | jobs | submit login
Show HN: Helix I/O – Ultrafast genomic search (helix.io)
27 points by nkrumm on May 12, 2014 | hide | past | web | favorite | 24 comments

This looks neat, but is the method open? Is there a command line version?

I'd argue that while open source is a desirable feature in most software, it's a necessary feature in scientific software. When our results depend on running through software, that software can't be a proprietary, closed-source black box.

We're working on polishing a command-line version of the tool, amongst other features. But we'd love to get in touch with users/researchers looking to run some large scale projects (our intent is to have an open, free-for-academic-use product).

OP here. We've been working on faster bioinformatics algorithms for the past year – this is a demo of one application which is effectively a fast replacement for BLAST. Happy to answer any questions about tech or usage.

This is really cool. I'd really love to set this up for use in the lab I work with. Are you guys planning on releasing a binary or web interface where I can upload my own *omic data for searching?

Hope it was clear that you can test it out with some small FASTA or FASTQ files now (just drag onto the upload icon).

And yes, we'd be happy to discuss your use case and get you set up with something. Just shoot me a note at nick at helix dot io and we can follow up off thread.

Once upon a time, I had briefly looked at gene matching algorithms like FASTA. n00b question: Can these algorithms be generalized to any code (not just genetic codes). For example, to match Java bytecode, or x86 op-codes.

In principle, some of them can be yes. Though genetic reference datasets tend to be much larger than, e.g., codebases -- and therefore require different approaches.

One of the core things we're working on is a data structure for more efficient, constant-time sequence lookups, and we've definitely thought of some non-bioinformatics use cases as well. Happy to answer any more specific questions by email too (in my profile)!

Wouldn't some of the new timeseries motif searching approaches work in this domain (you could assume each codon is a quantized enumerated value). Something like iSax (http://www.cs.ucr.edu/~eamonn/iSAX/iSAX.html) or Dynamic Time Warping (for motifs that are noisy) - http://www.cs.ucr.edu/~eamonn/SIGKDD_trillion.pdf

We help use the NMPDR RAST servers and also run our own for sequence annotation. I'd have to check but if I'm remembering correctly our annotations fix errors (~%5) present in the NCBI databases and catches many more (something like ~20% of unknowns) for the organisms we work with. Problem is searching our indexes is super slow (comparatively) and not all in one place -- BLASTn/p takes some fiddling.

Could we use our own corpus for this?

Ah, I see you have some problems with species identification...are you using old NCBI datasets?

(ps. If you'd like to talk, you can email me at the address in my profile)

Yes, we're building out the features to make it easy to use with your own corpus. I'll follow up by email, but can you clarify what species identification issues you're seeing? (Would like to double check and get it fixed asap if there's a problem).


It's a problem with NCBI, they have a bunch of H. haemolyticus and Hflu typed incorrectly. In fairness, I don't think NCBI is really on top of fixing those.

This is very cool. I love data, all kinds of data. The more data we have available, the better.

But... what's the use case for this? Are there people with genomic sequences just laying around with no idea of what they represent? Or is it a resource for students?

(Serious question.)

If I got the tool right, one use case would be similarity search.

So you just finished the brand new genome sequencing of Sugarcane for example.

And you are interested in Aluminium resistence genes to be able to plant sugarcane in aluminium packed soils that would normally kill the plant.

What you do is to throw parts of the genome in a tool like that (there are others) and see if you can find a match for a Sorghum(Wheat,Corn...) Aluminium resistence gene that has been already researched and tested. The closest related the better.

Now you have some clue about where to start your studies or which genes you should be looking for in your new varieties or which gene to try to insert in your next Agrobacterium or gene bombardment test.


PS: I left the Bioinformatics field a few years ago (best job I ever had BTW) So correct me if I am wrong.

cfontes – Yes, that's definitely another use case of the underlying technology, though please note that the database loaded for the demo only include archaeal, bacterial and viral genomes (so you'd have a very hard time getting a hit on sugarcane sequences!).

(This is nkrumm's and my project)


Thanks for the explanation and good luck.

Good question. We're targeting analysis of "next generation" DNA sequencing data, as is commonly applied in research settings today. Clinicians and researches are beginning to use sequencing to try and identify bacteria, viruses or other organisms in complex samples (either from patients or from environmental samples).

These complex “metagenomic” samples can be processed using our tool (we've set up a demo on the submitted link), in order to understand a sample's composition, or to see if there are relevant pathogens in the sample.

Hi and congratulations, this looks like an excellent tool for bioinformatics. Just out of curiosity, how can you mathematically guarantee constant time searches independent of the database size?

That was an attempt to say (without resorting to Big Oh notation) that the data structures and algorithms we're working on provide O(1) lookups that take a fixed amount of time regardless of the reference / dataset size. Does that make more sense?

I understand what it means, but what I don't understand, and was curious about, is how you can achieve constant-time searching or the dataset. I started with the assumption that you're doing approximate matches, is this correct?

Ah – well to clarify, we're performing exact match lookups on individual k-mers (k-length genetic strings), and then computing a classification result from those exact searches.

Thank you!

What stack are you using? Are you using anything like elasticsearch or non-sql Dbs ?

Would be a lot more fun if it had eukaryotes.

We agree! We're working on improving the scope of the reference database. Out of curiosity, what kind of use case do you envision for eukaryotes? (Amoeba, etc?)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact