This looks neat, but is the method open? Is there a command line version?
I'd argue that while open source is a desirable feature in most software, it's a necessary feature in scientific software. When our results depend on running through software, that software can't be a proprietary, closed-source black box.
We're working on polishing a command-line version of the tool, amongst other features. But we'd love to get in touch with users/researchers looking to run some large scale projects (our intent is to have an open, free-for-academic-use product).
OP here. We've been working on faster bioinformatics algorithms for the past year – this is a demo of one application which is effectively a fast replacement for BLAST. Happy to answer any questions about tech or usage.
Once upon a time, I had briefly looked at gene matching algorithms like FASTA. n00b question: Can these algorithms be generalized to any code (not just genetic codes). For example, to match Java bytecode, or x86 op-codes.
In principle, some of them can be yes. Though genetic reference datasets tend to be much larger than, e.g., codebases -- and therefore require different approaches.
One of the core things we're working on is a data structure for more efficient, constant-time sequence lookups, and we've definitely thought of some non-bioinformatics use cases as well. Happy to answer any more specific questions by email too (in my profile)!
We help use the NMPDR RAST servers and also run our own for sequence annotation. I'd have to check but if I'm remembering correctly our annotations fix errors (~%5) present in the NCBI databases and catches many more (something like ~20% of unknowns) for the organisms we work with. Problem is searching our indexes is super slow (comparatively) and not all in one place -- BLASTn/p takes some fiddling.
Could we use our own corpus for this?
Ah, I see you have some problems with species identification...are you using old NCBI datasets?
(ps. If you'd like to talk, you can email me at the address in my profile)
Yes, we're building out the features to make it easy to use with your own corpus. I'll follow up by email, but can you clarify what species identification issues you're seeing? (Would like to double check and get it fixed asap if there's a problem).
If I got the tool right, one use case would be similarity search.
So you just finished the brand new genome sequencing of Sugarcane for example.
And you are interested in Aluminium resistence genes to be able to plant sugarcane in aluminium packed soils that would normally kill the plant.
What you do is to throw parts of the genome in a tool like that (there are others) and see if you can find a match for a Sorghum(Wheat,Corn...) Aluminium resistence gene that has been already researched and tested. The closest related the better.
Now you have some clue about where to start your studies or which genes you should be looking for in your new varieties or which gene to try to insert in your next Agrobacterium or gene bombardment test.
PS: I left the Bioinformatics field a few years ago (best job I ever had BTW) So correct me if I am wrong.
cfontes – Yes, that's definitely another use case of the underlying technology, though please note that the database loaded for the demo only include archaeal, bacterial and viral genomes (so you'd have a very hard time getting a hit on sugarcane sequences!).
Good question. We're targeting analysis of "next generation" DNA sequencing data, as is commonly applied in research settings today. Clinicians and researches are beginning to use sequencing to try and identify bacteria, viruses or other organisms in complex samples (either from patients or from environmental samples).
These complex “metagenomic” samples can be processed using our tool (we've set up a demo on the submitted link), in order to understand a sample's composition, or to see if there are relevant pathogens in the sample.
That was an attempt to say (without resorting to Big Oh notation) that the data structures and algorithms we're working on provide O(1) lookups that take a fixed amount of time regardless of the reference / dataset size. Does that make more sense?
I understand what it means, but what I don't understand, and was curious about, is how you can achieve constant-time searching or the dataset. I started with the assumption that you're doing approximate matches, is this correct?