Hacker News new | past | comments | ask | show | jobs | submit login
Sequencing your DNA with a USB dongle and open source code (stackoverflow.blog)
276 points by TangerineDream 30 days ago | hide | past | favorite | 120 comments

Until you also realize you need a Qubit and the library preps and oh now you need NEB next gen enzymes and wow turns out pipette technique really matters.

That said, I love Nanopores, I use them in my business, and those error rates you can hack around if you know what’s going on under the hood.

"wow turns out pipette technique really matters" <- one of the most underrated comments of all time.

Boris Johnson gives a nice demonstration here:


The worst for me was coming in early and setting up gels. I'd drink a bunch of coffee, have shaky hands, and then break the gel with the pipette tip repeatedly while trying to jam the dna into the well.

there's a reason I went into automated biological robots.


Going on the same theme, what's an absolutely terrible example of technique?

Pipette skills improve rapidly if you practice with a microscale.

that's how we calibrated ours. turns out: most pipettes in the lab were miscalibrated, with 50+% error. Then it turned out our scale wasnt properly calibrated, so we had to replace that too.

I spent some more time thinking about this and wrote the following.

Modern scientists move small amounts of biological materials using a tool called the pipette. Pipettes can work with very small amounts of liquid- down the microliter. When you're running a delicate experiment being able to deliver the precise amount of liquid is critical.

Pipettes need to be calibrated. How do you calibrate a device that works with volumes of liquid? Volumes of liquid are hard to measure. Fortunately, water at STP (standard temp and pressure) has a known mass, so you attempt to draw 1 mL, and weigh it. 1 mL of water weighs 1 gram at STP (this is not a coincidence- it's by definition).

OK so you're weighing 1 gram of water and adjusting the pipette's calibrator knob so that 1mL on the pipette weighs 1 gram.

I guess that means your weighing scale needs to be calibrated, too. Huh. These sorts of scales aren't just "weigh some flour for baking", either. They have to be accurate to the hundredth of a gram, and have walls to avoid fluctuations due to air currents(!!!!) and minor temp changes. The scales are calibrated using calibrated test weights.

Oh dear. Calibrated test weights? If you follow the turtles all the way down, you find that there is actually a tracebility chain from your calibrated scale back to one of the defined weights held by NIST, the NIST equivalents in France and Japan (they all share their weights). So you can actually calculate- using those weird rules of error propagation you forget in high school- the error of your scale as a product of the errors in that chain (often, knowing your error bars is more important than knowing the accurate answer).

But that's not all. Those defined weights? They're obsolete. Le Grand K (the origin of the kilogram, still kept under lock and key) changed weight over time due to subtle metallurigcal details.

The new definition of the standard is created by an obscure machine at NIST, just like the time standards. https://en.wikipedia.org/wiki/Kibble_balance is the tool used to do it, and it depends on the NIST time reference.

So, turtles all the way down until you get the rubidium fountain.

Exactly. Better analytics can enable this technology to produce better results than competing technologies in less time. Once automated/easy/rapid sample prep comes, there will be mass adoption in the space.

Disclaimer: Co-Founder of BugSeq[0] 0: https://bugseq.com

> Once automated/easy/rapid sample prep comes, there will be mass adoption in the space.

Sounds like Elon calling biology a “software problem”.

Not saying that you’re wrong, just saying that the computational folk tend to discount the challenges and skills required in the wet lab.

Agreed - Definitely a different class of problem than "software". There are large barriers, eg. lab contamination, biocontainment, low input protocols, etc; however, technological innovation will help with these.

That being said, we see a future where someone without advanced molecular training can put a sample (whether that's a nasal swab, concerning white powder received in the mail or lab-grown meat) in a black box and get out a meaningful report.

>> Not saying that you’re wrong, just saying that the computational folk tend to discount the challenges and skills required in the wet lab.

It's time to bring in the industrial automation folks. They probably won't invent a fancy new algorithm to reduce the time to splice the pieces together, but they'll fine tune and automate your reader to the 9's.

Question: a majority of software environments will take on individuals without traditional qualifications, throw a dozen books' exam sections at them, and keep those who can hack it.

I just realized industrial automation sounds really interesting. What would my chances be for someone who never got the chance to study math?

(Basically in 1998 it was illegal to change schools in Australia regardless of how much of an eyebrow-raising situation you might've been in. Had to homeschool, without any resources. Only realized ~20 years on just how much opportunity I'll never get back.)

(Heh, I'm pretty much expecting the only obvious possible answer at this point, I was just curious if the answer is "yeah no" or "it depends".)

Yea automated sample preps are key for me. The main thing that is overlooked in synthetic biology about nanopore is it has the capability to dramatically lower cost of indexing, which turns out to be one of the main prohibiting costs for dropping the cost of plasmid production.

I don't think you need the Qubit with the rapid prep.

it works but your efficiency drops by quite a bit

> those error rates

Do a thousand readings, fix the parts that don't match across the board?

Then you have to do 1000x the sequencing, which can get expensive on long read technologies :)

That said, that’s basically how a lot of NGS works in things like cancer sequencing on Illumina platforms.

> Then you have to do 1000x the sequencing

Seems to me, that stuff is getting cheaper all the time.

RE: library prep. Voltrax from ONT automates the library prep process.

So happy to see this here. While sequencing is quite old, mass adoption still has not come. The benefits are clear - faster infectious disease diagnosis, personalized treatment, tracking the spread of infection, identifying food contamination - the use-cases are endless. However before nanopore sequencing came, it was always out of reach of the masses.

We've actually started BugSeq[0] to help labs get into nanopore sequencing - improving these open source tools and also writing our own. Orgs like FDA, USDA, big food co's, CDC, etc are now all adopting nanopore sequencing. Happy to see the industry taking off, this will be a step function improvement for public health in general.

(disclaimer: founder of BugSeq) 0: https://bugseq.com

personalized treatment is still best handled by gene panels. nobody has made a compelling argument for WGS for personalized med. Right now it's a huge waste of investment until we understand the multigenicity of diseases better (which is a research problem best solved by sequencing millions of individuals and using high quality WGS sequencers).

We work within the infectious disease space, so I'll give an example from our work that is still personalized medicine: Faster detection of antimicrobial resistance. Every infection will be resistant to different antibacterials/antivirals/antifungals/antiparasitics. What if we could get the patient on the right antimicrobial for their specific infection faster? There's strong evidence that timely administration of correct antimicrobials in septic shock results in improved mortality.

Nanopore sequencing very much has the potential to deliver this personalized treatment, without looking at any human genes or panels. If we could rapidly sequence bacteria in the bloodstream and predict their antimicrobial susceptibilities, we can make a difference.

What you're describing is a very reasonable research topic with some supporting evidence.

What I'm saying is that nobody has delivered on any of the huge claims about the genome which genomicists made for the last 20 years, specifically in terms of actionable human health.

it's time to start calling the bluff.

I'm not exactly sure how you can say that.

The following have been revolutionized by the human genome project and subsequent technological innovation in sequencing:

-Non-invasive prenatal diagnostics

-Screening for cancer with cell-free DNA

-Rapid and accurate diagnostics for children with suspected genetic disorders

-Targeted cancer therapeutics

Many of these are already in routine clinical use in high income countries and result in significant improvement in human health.

The impact is minor and most of the progress did NOT come from HGP data.

I worked in genomics for 20 years. I have deep knowledge of biology and medicine. And the reality is, for the amount of money invested, the actionable medical returns have been relatively tiny and industry continues to not invest in sequencers for a good reason.

> the actionable medical returns have been relatively tiny and industry continues to not invest in sequencers for a good reason

I agree with this, but I disagree with the following:

> most of the progress did NOT come from HGP data.

Without HGP (Human Genome Project), many biological discoveries in the past two decades would have become much more difficult.

> it's a huge waste of investment until we understand the multigenicity of diseases better

If you don't invest, you will never approach a solution. Applied science goes nowhere without a solid foundation in basic science.

None of the techniques you describe are reliant on WGS. I wholeheartedly agree that sequcning has revolutionized medicine, but WGS isn't there yet.

NIPT uses low-coverage sequencing to identify aneuploidies for chromosomes 13,18,21 and some larger microdeletion syndromes - this is not WGS.

Cell free cancer screening is panel based and assays specific, known driver mutations.

Rare disease diagnostics can be WGS based (and some of the rapid 48h WGS studies of NICU babies are compelling from a technical standpoint) but most diagnoses identified via WGS can also be found via WES + chromosomal microarray.

Targeted cancer therapeutic target identification is panel based for most patients, as WGS doesn't identify too many targets for FDA-approved therapies that a panel + IHC + FISH + fusion testing won't.

>What I'm saying is that nobody has delivered on any of the huge claims about the genome which genomicists made for the last 20 years, specifically in terms of actionable human health.

I mean. Sure, sequencing the human genome didn't solve our problem overnight, and you can't sequence a genome at a vending machine for a nickel to tell your future, but I think there has been an avalanche of medical data derived from the genome and that is only continue to get bigger.

Now that we are really starting to figure out the polygenic risks and the single deleterious variants and their links with phenotype, people will have a much better picture of what their future might hold (and how to prevent it).

I don't think it was ever a bluff. The problem just turned out harder than we thought it was going to be.

it didn't turn out to be harder than I thought it was going to be. I came into this in the 90s fully prepared for the idea of polygenic risk. In my opinion, most people who did molecular biology first think that way, while most people who learned mendelian genetics don't.

I had my genome sequenced a few years ago by Illumina. They had a big slick presentation, blah blah blah, ApoE1, etc. When the genetic counsellors came to my genome they said "huh. you don't have any risk factors". I checked and each of their risks was from an existing gene panel, so the WGS wasn't valuable (it's on PGP, if you want to work with it https://my.pgp-hms.org/profile/hu80855C).

I talked in more detail with the counsellors. Turns out, whenever they saw a novel variant that wasn't covered by a gene panel they were googling the variant and skimming the abstracts of papers.

It was at that point I realized the difference between research, PR, and actionable medical data.

>it didn't turn out to be harder than I thought it was going to be.


I've done my as well. Most of the "company" sites don't tell you much, which I think is a legal thing. They aren't cleared to release clinical predictions from genotypes, so they just... don't. I ended up running my through promethease (which mines SNPepedia) and found quite a bit more than what was reported.

I work with some certified clinical geneticists and yeah they do take a much closer look, but at the end of the day its all just sequencing and interpretation. I think its mostly just safeguards to keep bad actors at bay.

PGP looks interesting. I see that you submitted phenotype data. I didn't know they had a questionnaire with that. That's actually really interesting. I need to see what kind of questions they ask.

This was a talking point like 10 years ago which isn’t remotely true today

Sounds like the mission of Day Zero Diagnostics (dayzerodiagnostics.com). Are you working with them?

“nobody has made a compelling argument for WGA for personalized med” is a huge overreach. It’s done routinely now at most academic cancer centers and often is useful for guiding treatment decisions. There are several multi billion dollar companies that do this already.

One example: Homologous Recombination Deficiency, the signature it leaves genome-wide and the associated sensitivity to PARP inhibitors.

But agreed, it is about time we start to understand regulatory regions better. But that will require gathering more WGS data, and indeed most data is Whole Exome or Panel.

Research project, not actionable human health. I fully support large-scale WGS projects and hope that some day one of them will have a recognizable impact.

I don't know about this specific example, but DNA sequencing is already routinely used for personalized oncology therapeutics outside of clinical trials, so not really research project.

Source: Am MD and practice laboratory medicine.

Sure. Doctors love to try new technologies. most of the reports of success are happy narratives, not evidence based medicine.

There are hundreds of clinical trials and it’s used today in clinical practice, most commonly in oncology but more and more in other fields. Very interesting work in polygenic risk prediction models in many kinds of chronic disease, where risk models can refine treatment strategies. It’s very real; one of the big problems so far has been reimbursement and commercialization.

The other thing you have to realize is that because of the regulatory burden, it takes a while for these tools to make it into practice. Many of the successful genetic tests today were approved 20 years ago. Look up Oncotype Dx which is used in a huge % of breast cancer surgery, for example. WGS and WES will undoubtedly be far superior but it takes a while to get these things into practice.

Except when it makes it into guidelines written by groups of experts aggregating evidence. I can't copy due to copyright but hope you can find the content past paywall or peruse the citations: https://www.uptodate.com/contents/next-generation-dna-sequen...

The development in the article is basically gene panel using nanopore sequencers. It dynamically ejects sequence that doesn't match the thing the user is interested in programmatically

I think typing your HLA class I and II genes is the single most valuable thing you can get now from your genome. It's also pretty likely to remain extraordinarily valuable even if whole-genome sequencing prices drop to nearly zero.

HLA associations with autoimmune disorders are extraordinarily strong. Same applies to infectious diseases, vaccine efficiency and checkpoint inhibitor efficiency.

While you can type HLA with classical techniques, the only really reliable way is really to use long reads.

Same applies to CYP enzyme superfamily, where variation is linked to some rare drug toxicity events for example.

We should all know our HLA and our CYP genotypes. Why 23andme does not even attempt to impute HLA is beyond my understanding.

Totally agree! I would suggest adding KIR as well. Curious what your background/interest is?

I have consulted to National Marrow Donor Program/Be The Match [0] off and on for several years. There are typing labs using long reads but most reporting/matching/analysis is still performed at the nomenclature level [1].

I hope in the near future we'll be able to simply assemble the entire MHC for each sample, as messy as it might be, see e.g., "A diploid assembly-based benchmark for variants in the major histocompatibility complex" [2].

[0] https://bethematch.org [1] https://www.ebi.ac.uk/ipd/imgt/hla [2] https://www.nature.com/articles/s41467-020-18564-9

Sure! KIR and HLA-C are also really important.

But we know less associations about them. Same applies to TCR genes. A chicken-and-egg problem, we need good massive GWAS to find out.

My background is in CS, AI and statistics. But I've done lots of graduate research in genetics and epigenetics. I'm very interested in understanding the interactions between HLA and commensal / pathogen epitopes in health & disease. Also in vaccine design.

How about you? I can see from your posts you are with the Big Data Genomics team at UC Berkeley AMPLab.

All great until this was used for people control. Collecting dna which you cannot control and even can trace your race or relatives.

We have internet. Great. But look at the dark side. DNA is great like target medicine but you have totalitarian regime which might use it.

Need some sort of awareness. How to deal with the two sides, let us discuss once you know there is a very dark side to it.

Thanks for your concern. All technologies come with benefits and risks. Of course, DNA sequencing can be used for harmful purposes, eg. tracking individuals. We should be very cautious of these risks as the technology develops, and take well thought out steps to mitigate them. A similar analogy can be made to the internet and tracking people. Overall, however, the benefits of DNA sequencing to society already far outweigh these risks.

Anyone with hands on experience using NanoPore? I've been thinking about buying one of these to play around with. But anecdotally I've heard that they lack utility or are my concerns just myths? a) they are designed to handle many batched samples at once rather than many runs of few samples over time. So in practice they don't really last for many individual samples. b) the computational requirements are high. So while a NanoPore can be plugged into a laptop in the field it would take forever to run the data processing on said computer.

Here's my write-up of buying one for fun: https://abarry.org/dna-sequencing-in-our-extra-bedroom/

> never arrived. Finally, I emailed them and found out I needed to join their “community.”

> Then Alice starts asking questions. I was not prepared for questions

> Then Alice started asking questions about my research

Wow, that is kinda creepy. You're seeking to pay them money for a product they sell, and they want to ask all these intrusive questions about what you do? Is there a reason for this? I mean imagine if DigiKey did this, what a nightmare that would be for electronics design.

Are they worried about you using it as part of a lab producing bioweapons? But frankly anybody with a shot at pulling that off would already know enough to lie very convincingly. I can't imagine this interview would be a hurdle to those folks.

The only other place I've come across this behavior is when seeking to buy wafers from semiconductor foundries. In that situation the foundries see themselves not as vendors to the chip designers, but rather as investors in the chip designers -- investing not money but rather production capacity, and earning not dividends but rather wafer purchases.

It wasn't creepy. It seemed like they were worried about folks who thought it was more like 23andMe "spit in a tube" than an actual lab instrument.

Oh, okay, but then why this:

> I repeat something about bacterial colonies from my card but she isn’t buying it. I manage to get out that I understand this isn’t a spit-in-the-tube-and-done thing and that’s all I’ve got. She keeps pushing

I mean you told her pretty clearly that you didn't think it was like 23andMe.

Why would she keep pushing?

From your post the thing uploads the sequenced data and their service generates the report. Is the raw data available?

Also: truly remarkable phd thesis!

Yes the raw data is all there. The analysis just generates more interesting pictures!

Nice write up! How much did it cost to get a Oxford Nanopore?

$1k prices are on the website.

Computational requirements are quite high, but OK if you have good GPUs on hand. A coronavirus sequenced sample on the fast mode without GPUs would take 3-4 hours to complete, while on the high accuracy mode days. GPU access would speed up performance considerably.

Error rate for MinIONs is still quite high (10-15%), so a human genome sequencing would be quite inaccurate in some regions.

Sequencer is quite cheap, reagents and flow cells are a little bit more expensive.

Thank you. The upfront cost of the sequencer sure makes it tempting at first sight.

My desired hobbyist use case is to key out plants, lichens and mushrooms that I find in the field. I have the bioinformatics knowhow just need the hardware. 3-4h seems lika a long time for a genome that is <30k nucleotides long. Mushrooms on average seem to have almost as many genes as coronaviruses has nucleotides. I guess partial sequences (and thus reduced comp time?) might do the trick but it's probably hard to target those partial reference sequences with a long-read method like NanoPore.

If you repeat the process many times will it reduce that error rate, or are the errors non-independent?

Unfortunately, with nanopore the errors are biased so you tend to get errors in the same places. All sequencing techniques also have error rates but some are unbiased so running a single sample through (which will usually have many, many copies of any sequence) will average out to a good read of the sequence.

Some good info on next-gen sequencing techniques: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3841808/

Still, some of the errors can be compensated for with more coverage. So if you can manage 20-30X you're left with the homopolymer problem (nanopores can't tell how long a stretch of the same repeated nucleotide is, because you can't control how long the sensed kmer stays in the pore), but lots of other types can be improved quite a lot.

Last time I looked into Nanopore the cost wasn't that much better where you'd even consider this experiment.

On the other hand, when doing a genome assembly, the Nanopore reads are good for a draft sequence and then the Illumina reads can be used to polish the sequence.

Is the error rate per base pair?

A close friend of mine has worked there for many years. We've spoken a lot about the tech.

I don't know the answers to all your questions, but I do know that the emphasis is on research, not consumer (or hobbyist) use. I believe the devices are ~free, but each run requires using a consumable part that has to be either disposed of or returned for refurbishment, and I believe these are hundreds of dollars each.

The big advancement is the size and cost of the devices, the fact that a lab can have one on every desk rather than a communal machine that you have to queue your samples up for, or a device you can transport in field kit.

They do have cloud services that do much of the processing for you, but I suspect you'd want to be able to manipulate the data so you'd need your own data processing tools locally. It's not going to give you a 23andMe style report, it's more likely to say "yep, that's a human" vs "you're ecoli". I believe they do have training for how to do this data analysis, but I suspect this is targeted at customers on large contracts.

Thank you for the practical insight. I suspected that NanoPores are not just yet geared towards hobbyists. I happen to have some bioinformatics knowhow so it's mainly a matter of hardware for me. As both you, u/bioinformatics and u/searine mention it is the overhead cost of flow cells etc that worries me from a hobbyist point of view.

Yep the flow cells are expensive, but more generally saying the hardware costs $1000 is probably more of a technicality – almost all sales are bigger contracts, with training, support, flow cells, hardware, included.

Also that's great you have some bioinformatics experience, but are you sure it will apply here? I'm very unknowledgeable about this so forgive me if I'm wrong, but I believe that traditional machines do short-reads, whereas Nanopore does long-reads, which I believe invalidates many techniques for reconstructing the data, the tools, etc. This might not be something you have to be concerned about depending on the level you're working at, I don't know (maybe this is all "solved" by the time you're doing analysis?).

I think theoretical/practical background will allow me to figure stuff out as I go with relative ease. Often times in bioinformatics there are useful vignettes a Google search away. But you are right that specialized tools and libraries which I haven't used before are likely needed. At least for me the main bottleneck will be proper space for processing and the aspect of (reusable) hardware to keep costs down costs. While NanoPores appear to have hidden costs it still seems to be way cheaper than other methods if I want to have my own sequencing rig.

They are a fun tool, great for doing molecular work in the field. The error rate is still very high compared to short reads, but if you know this and plan for it going in you should be fine.

Flowcells last for one sample. The machine should last indefinitely. You can sometimes add more of the same DNA to a flowcell after one use to get a bit more out of it, but the quality degrades quickly. 500-1000 dollars each for flowcells, depending on how much you order.

My experience in field use, I was using Oxford Nanopores software which does processing remotely and was able to run the the platform on just a regular 2015-era laptop.

Is the price a question of scale? If this technology would become commonplace, would the price go down? Are there patents that would prevent cheaper chemical production?

I assume scale and more R&D on how to produce nanopores more cheaply would be the main ways to drive price down. As for patents, Oxford Nanopore has a pretty big portfolio for all things nanopore, so a direct competitor based on nanopores that would drive the price down seems unlikely (though they obviously have to compete on price with other sequencing methods to some degree).

What is a flow cell made out of and why is the cost so high?

It's made of plastic, glass, and the special protein pores which split the strand and read the DNA. Reagents and sample are applied to it to make the reaction happen.

The flowcell gets contaminated with your sample after one run so they are 'one time use'. The nanonpore protein eventually stops working also.

They are expensive because doing molecular biology is expensive. It requires expensive machines and expensive reagents at atomic scales to create. Thus money is required.

Actually, one of the main features of this tech apart from the obvious size-factor is that it's a streaming process. You can analyse data on the fly and decide when to stop the run. Wash the flowcell, and use it for another sample. Eventually the pores die, yes, how fast depends on the sample type. I think they guarantee 48 hours or something of the sort.

The expensive part is not the chemistry. Each flowcell has a very expensive piece of metal that senses the very small current variations that each kmer causes when going through each pore. They've actually come up with a device (horribly named "flongle") that has the same shape of a flowcell but no pores, and the mini flowcell it uses is ~90USD (against ~900USD for a full flowcell). Of course, yield is much lower.

I used to run a department at a biotech where ~50% of our data came from MinIONs (although, that said, I'm a bioinformatician, rather than a molecular biologist), so I can answer your questions. For (a.), you can for sure "batch" samples. The term of art you're looking for is "multiplexing". Nanopore provide prep kits that allow you to "barcode" different samples (i.e. tag all the molecules in a given sample with a unique, synthetic sequence, which allows them to be distinguished by software downstream), but note that (as with all DNA prep kits, but some more than others) you'll need access to a fair whack of lab equipment and consumables to use it (these kits aren't "all-in"). For (b.), for one anecdata point, I used to process a whole flow cell's data on an M4800 with a 4th Gen i7 and 32 GB of RAM in a few hours. Most of the "high" computational requirements you hear about relate to either assembly or variant calling (both of which are downstream of just retrieving "usable" sequencing data); and even both of those I've managed on that same laptop overnight. Actually acquiring the data (you can delay base calling if you like, although you probably wouldn't need to) is real-time and only needs very modest hardware (IMHO the Nanopore "system requirements" are very much on the "safe-side".) "In the field", your challenge would be physically preparing the samples!

My first reaction after reaching the halfway point in the article was to check it was not April 1st already.

But even on a site like Stackoverflow (hey I can trust Joel right?), and even after coming here and reading "hey yes we build / use those too" I am struggling to believe this.

What else don't I know about in biotech? How far ahead is the industry compared to where the average man on the clapham omnibus thinks it is.

Please stop the world I want to get off.

Nanopores have unacceptably high error rates. Around 10%

Is this an accuracy or precision issue? I am imagining that if you actually have access to the device, you could do as many runs as you want, getting to arbitrarily low error rates.

This is a common misconception - "averaging out" errors only works if the errors are pretty rare at any given site. This is true for some types of errors & sequencing technologies, but not universally true. Some types of DNA sequences (most notably homopolymers and other simple repeats) are very difficult to sequence correctly, and X% of the reads there will be incorrect. If X>20% of so, then it may look like real germline variation no matter how many reads are sequenced

The errors are non-random. That's why they use machine learning to figure out those errors. You could, of course, also just do traditional statistics on sequences that you want to sequence all the time. I've done that with plasmids before, and it works pretty good. I think there are a few papers on it too.

> The errors are non-random.

Could you elaborate / give an example? Are the errors deterministic? Is it like ISI (Inter-Symbol Interference[1]) in signal processing, where some symbols interfere with the reception of the next symbol(s)? Are there short range errors (one letter) or long continuous errors?

[1] https://en.wikipedia.org/wiki/Intersymbol_interference

It's a complicated issue; I tend to think of the error component of any one MinION observation as being a function of the k-mer in the pore at the time (i.e. the subject of the observation) and, with some decaying dependence, the sequences (i.e. in both directions) that extend out from either side of the target k-mer. You might say that MinION error is a function of the target k-mer and its immediate environment. It gets even messier when you try to imagine the form of that function; for one, it's not _completely_ good enough to remain in sequence space alone: among other things, the "shape" (i.e. the conformation) of that (DNA or RNA) molecule around the target k-mer will influence how the shape of the pore will change in response to the target k-mer, which, in turn, will influence the observed current signal (i.e. manifest as a deviation from the "expected" or "ideal" current signal for that k-mer!). As I understand it, Nanopore don't spend too much time actually modelling k-mer-in-pore dwell-mechanics; instead their best base callers use machine learning to generalise across the swathes of available sequencing data for known targets (and give really quite impressive results, all things considered).


There is a real example I ran a few months ago. How to read it is here https://en.m.wikipedia.org/wiki/Pileup_format

Positions like 172 have errors more often than not because the basecaller is wrong sometimes (note: this is from a sequence verified sample).

The errors come up more often in some sequences than they do in others. I’m not really sure about symbol processing, but if you have any beginner resources for that I’d appreciate them!

don't know why this was downvoted. If I'm not mistaken, there is generally a high error rate per pore fundamentally because it's a single molecule experiment. These get averaged out, but may be difficult to align as it might not necessarily be a straightforward averaging. There are also segments that are fundamentally generally difficult to sequence correctly (single nucleide runs, not even a super high n) that will probably never get satisfyingly resolved no matter how many times you sequence.

Are you sure about that? My last consensus run worked with complete coverage of ~410 bp region. Here is a gist of the raw pileup without consensus - https://gist.github.com/Koeng101/abc674e1acd575646748afcbcc7...

Visually, I think, you can see that it isn't THAT bad (low coverage at the ends is because of how I barcoded the sequences).

I hate to be that guy, but have you actually used the technology? And if so, approximately what year? Unacceptable for what procedure? Do you have any raw reads that have been troubling you?

They mean at genome-wide scales. If you are just doing a 410bp the sequence is short enough that the signal of is going crush and noise you get from strands slipping in the pores.

The errors nanopores get are gaps, not base pair substitutions. So with things like viral or bacterial sequencing you don't really have huge issues.

When you are doing large eukaryotic sequences with lower coverage on average, you start picking up a lot of deletion artifacts. Which isn't a huge deal if you have a very well annotated genome like human, but if you are doing pioneer genomics it can create some difficulties. Often if the genome isn't well annotated, its best to pair nanopore with short reads.

The gaps are usually homopolymers and such, which should get helped by R10 pores. But true, at low coverage, things can get tougher!

That all depends what you want to do with the data. For assembling new genomes, they produce very long reads that are essential for "scaffolding". They're also great for structural variant detection (large rearrangements of DNA). DNA sequencing is not a monolith and there's room for lots of different complimentary technologies.

It should be noted that the "errors" in this case are gaps in sequence. Sometimes the DNA strand slips through the pore and some bases aren't called.

The actual base calling is on par with Hi-seq in my experience. In software terms, you are missing chunks of code, but aren't flipping bits.

This is important because in certain experiments, you care less about those gaps (scaffolding for example). So you can get a lot of cheap utility out of nanopore sequencing.

This is a common, and often justified, though not always fair, criticism. MinIONs have an error rate of around 10% for _any given base_. Moreover, these errors aren't entirely independent of one another, so if you struggle to sequence a given base the first time, you're likely also to struggle if you try again. That said, if your experiment is such that you're only sequencing a guaranteed single target (e.g. one, isolated coronavirus genome), in that one sequencing run (on that one flow cell), you'll "re-sequence" the same any given region many times and, unless you're looking at "problematic" (i.e. low-complexity) regions, you _will_ be able to "average out" the errors to reveal the true target sequence. On the other hand, if you're trying to co-sequence a mixture of closely-related targets, that's when the headache starts...

The advancement in DNA sequencing tech for humans, have been a boon for fighting extinction of other animals too. Sequencing bird DNA from feathers to determine their migration and check population was envisioned decades ago and has only been made possible recently to the advancement of the tech.

The Bird Genoscape Project[1] was also showcased in this excellent Nat Geo video[2].



seems pretty impressive. Here is the code linked in the article that does the signal processing to decode the sensor data into DNA sequences. https://github.com/skovaka/UNCALLED

How does it handle repeats? I can understand reading AACCCT... since they say the signal depends on several letters. But what about 12 Gs? Or longer runs of the same letter. Is the some way to clock one nucleotide at a time?

As others have said, you're reading a sliding window of k-mers over the target sequence; I think for the MinION k is presently 5. To answer your question directly, it struggles with homopolymer runs, not inherently because they're low complexity, but actually because it's tricky to "clock" how many like, contiguous k-mers have passed through the pore after a given period of time. That is to say, for example, if your target sequence is "GGGGGGG" (i.e. a homopolymer run of 7 Gs), you'd expect to observe three like, contiguous signals (i.e. in current space) for the all-G 5-mer, one signal each per "clock cycle" (which corresponds to the dwell time of the k-mer in the pore). If these "clock cycles" were always constant, it's merely a case of dividing the "time spent on the observed all-G 5-mer" signal by the the "time spent on one clock cycle". Sadly, for our purposes, there's enough wobble in any one such "clock cycle" that that calculation won't always yield a reliable result. The upshot: your "GGGGGGG" (7 Gs) target sequence may be registered as "GGGGGG" (6 Gs) or "GGGGGGGG" (8 Gs), or even something else. Now, for distinguishing two alleles where the difference between them is, say, a doubling in length of an already-very-long homopolymer run, even with the aforementioned "clock wobble", you'd likely be able to see that in MinION data quite clearly. As with all thing DNA sequencing (for the time being, at least!), your precise biological question will determine which (one or more) sequencing techniques are best for the job!

Just a thought. If the DNA were run through 2 such holes, you could use a nearby non-uniform sequence to clock the reading of the other one. Not a magic bullet, but maybe an improvement. Assumes the readers can be close enough to bound the amount of slack between them, and that they dont interfere with each other.

Good thinking! The newer R10 pores have a dual read head, essentially two holes in sequence inside the same barrel. The linked page [1] has an image.


A single flow cell contains a few thousand pores (I think this is what you mean by "holes") that are all at different stages of passing different molecules, with signal data being captured from a few hundred at any given time. In practice you'd never expect (nor could you arrange) for two pores to be at the same stage of processing the same (or any pre-determined) molecule at the same time, so correlation information like that is out. The "clock rate" is determined by the so-called motor protein that "pulls" the nucleic acid molecule through the pore, if you fancy going down the reading rabbit-hole...

No, I meant a single pore with two readers. So the same molecule is being read at 2 positions. Movement might be detectable in one, but not the other because it's full of repeats.

Nope. You're working with kmers. I think it's 6mers in the current models. It's good because you get redundancy as you move, but coupled with the fact that you can't control dwelling time it makes repetition hard to handle.

Really cool to see this here. I worked on solid-state nanopore development as a part of my PhD.

Here are some press releases related to articles I published during my PhD:



> If you try to commercialize it, that takes a while to start a company, and it can take so long that by the time you go to the mechanics of that, the next thing has already emerged.

Technological singularity is here! :)

[1] https://en.wikipedia.org/wiki/Technological_singularity

Wow I never thought of this. I understand all the controversy over 23&me and DNA secrecy, but it seems pretty soon it'll be trivial to run DNA anywhere anytime.

I'm wondering about the impacts of cheap/accessible DNA sequencing in the future. Not just impacts to existing businesses, but what does it mean from a privacy perspective? If someone could take a strand of your hair and then get your genome sequence from it - what would be the implications?

In the long future: total loss of privacy and identity as meaningful concepts.

I was wondering if this would be a solution to the privacy issue that surrounds 23andMe. I would gladly pay a significant amount above $99 (or whatever 23andMe charges) for an offline version of DNA insights.

Also, this is not new. It's been around for yrs

Maybe, but it was a good summary for me and I've been in biophysics for 3 years or so. Also, lots of good keywords and discussion generated here to follow up on. Overall, very useful article and discussion.

schatz periodically dumps PR for attention

Maybe if you ran the test 100 times and did some pileups by position it would be usable in comparison to WGS

If you want to see what a real run looks like, here is a little gist of my last Nanopore run, raw basecall -> alignment (no consensus)


Could DNA sequence be used as a private key / seed for a Bitcoin wallet? It does make sense?

At the 2014 DEFCON Biohacking village I did exactly that. I gave out like 50 tubes of plasmid, all you had to do is go sequence em to extract the private key, and boom, you get like $200 (or like 15K today...)

Literally nobody did it for a couple years, so I ended up taking out the bitcoin to pay for more DNA synthesis a few years ago. I actually did delete the bitcoin private key though, so I had to pay for sequencing it back out...

what was your encoding scheme? hash of some character representation was the key?

2 base pairs per byte mapping. Super simple.

Any password-like object has to be changable. And easily.

Same problem as all biometrics. Data about you makes for a bad password. It can make an ok username tho.

Memorising the seedwords of one key + a backup key in a 1 of 2 multisig setup seem to be a good alternative.

How do we get a dongle?

Incredible, thanks! Cheaper than a shotgun sequencer :)

Bear in mind that it'll only work once. Additional consumable flow cells are a similar price.

I think it's misleading for the article to call this a dongle. It's a $1k USB device with expensive consumables, more like a printer.

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact