Is there any way to just have your entire genome sequenced and get all the data in a software-friendly format? At that point there could/should be some open source software for analyzing it and finding common or well understood things like this. That way the software could be updated and people could re-run their analysis to look for newly discovered stuff.
I think this would be an awesome amount of fun. I for one would be interested in looking for certain gene variants that are not mentioned at all over at 23andMe.
Check out this series of articles from 2016 by Carl Zimmer [1]. He gets his genome sequenced by Illumina ($3100) and joins a medical study so that they'll give him the raw data. He gets the 70GB "BAM file" (Binary Alignment/Map) and passes it around to experts and they dig into it. Multiple weeks of computer time plus expert analysis---this is not a simple thing yet.
Have you already done a 23andMe analysis? If so, you can check out https://promethease.com/. It's exactly what you're looking for as they have constant updates that make it worth your while to rescan every year or so.
Promethease is awesome. I uploaded my 23andMe data to it and got back the kind of data I'd been hoping for in the first place.
Fair warning: the UI is very geeky. I think any HN reader should be able to find their way around without trouble, but I wouldn't recommend it to my non-technical friends or family.
23&Me will let you download a text file with the ACGT data, but only for the SNPs that it has. 23&Me does not sequence your full genome, so the SNPs available are a small subset of your DNA.
Prometheus is not open-source (I think), but all it does is read various files with DNA data (like the 23&me export), and match it up with the information in SNPedia (a Wikipedia-like open repository of what we know about certain SNPs), and then exports it to a pretty HTML/JS web report for you that you can download and save.
I did Illumina UYG. As part of that I got a 1TB hard drive with the nearly-raw files (BAM format with raw reads, VCF with variants).
Lots of people say " I for one would be interested in looking for certain gene variants that are not mentioned at all over at 23andMe." but they either never do anything with the data, or they look into it and realize that SNP analysis of gene variants is still a charltan's game.
For those interested in doing this, I will second parent. I have my full genome sequenced, and learned basically nothing that was all that interesting or actionable. It is very early days for DNA analysis.
I'm a bioinformatician, but haven't really pondered doing this on my own DNA too much. Wouldn't be terribly complicated to sequence and analyse your entire genome though.
Provided you could purify your DNA, sequencing wouldn't be an issue - just send it off to someone like BGI (Beijing Genomics Institute) and download the seq files when they're done. Purified DNA is stable and inert, so no special conditions required for posting it.
Sequence files are just text (if they're in FASTQ format), and all the common tools are open-source. No doubt someone somewhere has put together a Docker image with software for the entire workflow (FASTQ file processing --> read alignment --> variant calling), so processing isn't a big issue. As there's no de novo genome assembly or anything like that, the whole thing can be done on a run-of-the-mill PC, and would take a few days, depending on the depth of sequencing.
My guess is cost would be approaching US$1000 now.
Totally agree with everything you said, although I believe the price is closer to 500 than a 1000. I worked at a lab last year which did methylation analysis on rat genomes, and the price for sequencing was not nearly a 1000. Although the analysis was slightly different since they pulled out all the non methylated DNA, we still ended up with >50GB of 50 bp reads that had a decent coverage of the genome. I'm certain that whole genome sequencing would be easier than what I described.
I couldn't find any format that was neutral between vendors, so I wrote something (dna2json) that converts these vendor specific ones into a flat JSON file you can query easily.
If you are willing to let your DNA be publicly available online, check out George Church's Harvard-affiliated Personal Genome Project: http://www.personalgenomes.org/
The goal is to have a dataset of free and open genomic data so that scientists can analyze freely and avoid commercial silos of data.
They will sequence your entire genome for free, subject to a backlog caused by funding shortages.
I think you can pay $1,000 to jump to the head of the line. You may also be able to jump to head of the line if you meet certain "interesting" criteria, like willing to have multiple folks in your family get sequenced. Haven't looked into this in a while, so you'll need to check and verify this paragraph.
OpenSNP is for analysing your own SNP data, which you can download from services like 23andme: https://opensnp.org/
However, sequencing your entire genome is generally not available commercially. If you can find it, expect to pay at least a few thousand dollars for the raw data, and that's just sequencing reads that will need a lot of work to get to anything like a genome. Your best bet might be to try to join a genome sequencing research study and pre-agree to have access to your own data.
It's way less than that. They scan SNP's, of which there are about 10 million in total. So only 0.3% of the human genome varies between all of us. I think they only do the 602,000[1] most common SNPS, which is only 0.02% of the genome though they might do a few more.
A single error in the very large part of DNA that shouldn't vary per individual but "makes an ordinary human body with normal systems" means that you don't get an ordinary human body with normal systems.
Many such errors cause non-viable embryos, but if you have survived up to this point, then such a difference is still quite likely to have a meaningful impact to your health and is precisely the part that you'd want to have scanned and verified.
For adult DNA scanning we're not really interested in all the genes vary between all people and code for the color of your eyes, the melanine content of your skin, the shape of your nose or your height - but we are very much interested in, for example, scanning your genes that encode CFTR protein to check if you (or your kids!) will have issues with cystic fibrosis.
It's possible that you don't really have (or your kids are likely to not have) an "ordinary human body with normal systems" - that's what you'd need to find out.
>A single error in the very large part of DNA that shouldn't vary per individual
However true, that is irrelevant to genetic diagnostics as they exist today. We have no idea how a random error might impact health aside from very limited known mutations that are sufficiently frequent in the population to enable statistical correlation. We are probably decades away from being able to say, for a random mutation, 'this will lead to a deficiency in the synthesis of protein A which impact the development or working of organ B'. We can't even agree on the proportion of junk DNA.
This is helpful if you have rare symptoms with no currently available explanation - if you get a list of the "unusual" mutations that you have, and correlate it with the same data from the few people worldwide that have the same issue, you get a possibility to improve that condition.
I recall seeing cases of rare genetic disorders that have been diagnosed that way, by online communities sharing data.
http://matt.might.net/articles/my-sons-killer/ is one story that counters "this will lead to a deficiency in the synthesis of protein A which impact the development or working of organ B". For many parts of DNA we do know what protein it makes. For many proteins/enzymes/etc we have some idea about their function in the body - and if we have a test subject missing that protein, then the symptoms will be even more indicative about this, even if the population is tiny (1 in this example!) and doesn't allow for any statistical inference.
This means that if we really want to, we can try to find out the likely effect and possible workaround of a particular mutation, even if we currently don't have a ready-made answer for it.
That's right. I'm suggesting the average diagnostic test need only concern itself with those areas of the genome that are know to contain mutations that result in pathologies.
I think you said the same thing but in a clearer way.
I think the problem is that aside from a handful, they are not known. It's like saying you're only going to copy the parts of a program where bit errors are known to cause problems.
It's not that only 2% _can_ vary, it's that each person has about 2% different from the reference genome (and that 2% is different for every single person).
That's not at all true. For rare, undiagnosed disease we have to sequence the entire genome in order to look for the causative variants. For well understood (common) genetic disease we have small panels, but to say that only a small portion of the genome is informative is not correct. Additionally, there is no way to know a priori which loci will have the variation without sequencing the entire genome.
I think 23andme's genotyping is about 1% coverage compared to whole-exome sequencing, which itself sequences the ~1% of your DNA that codes for proteins.
One option is to take raw data from a service like 23andme and use imputation[1] to generate the missing data. This isn't as accurate properly sequencing everything, but it will get you more data to play with for free...
I think this would be an awesome amount of fun. I for one would be interested in looking for certain gene variants that are not mentioned at all over at 23andMe.