
An ambitious project to map all the cells in the human body - ringshall
https://www.nature.com/news/the-human-cell-atlas-from-vision-to-reality-1.22854
======
xvilka
By the way, the finished or in progress previous great mapping efforts, like
human genome or human brain connectome - are those datasets available to
download somewhere?

~~~
vamin
Not sure about the brain connectome project, but for the human genome project
(and many other genome projects) whole genomes are available for browsing and
download here: [https://genome.ucsc.edu/cgi-
bin/hgGateway](https://genome.ucsc.edu/cgi-bin/hgGateway).

------
vanderZwan
It's likely that most people here are not up to date with just how quickly the
field of biology has changed in the last decade. Last year I somehow bumbled
my way into programming for Sten Linnarsson's group[0], via a HN who's hiring
thread, no less! That group turned out to be _Kind Of A Big Deal_ in this
field (I mean, I was just impressed by how well-written Sten's code was,
especially for a professor, and liked the idea of programming for scientists).
The stuff I learned about this field is pretty mindblowing.

Think of the wall of text below as context for this article, as explained by
someone not inhibited by proper knowledge of what he's talking about.

We start with a tiny biology refresher. Every cell in your body is essentially
a clone: barring mutations, they all have the same genetic information encoded
in their DNA. We often think of DNA as nature's code. To build on this
metaphor, think of your stem cells as freshly installed PCs with all possible
software you could need, pre-loaded on a gigantic HDD: your genome. And just
like a program on your disk, genes don't do anything until they're activated.
To "run" the code in a gene, the cell makes active copies of it: RNA. In
oversimplified terms, a copies of RNA represents loading a genetic program
into RAM and running it. The rest of the cell is basically the wetware
required to run that code, which is important too: a PC is kinda useless
without IO (just look at the struggles early nerds had with simply finding _a
use_ for the Altair 8800 if you don't believe me[1]).

Anyway, going from stem cell to a specific cell-type is like setting up those
identical PCs for the different things that we use PCs for by opening
different programs. Unlike PCs our wetware is massively parallel: to get more
performance in a specific task our cell just creates more RNA copies, "loads
many copies of the same program". Because of this, we can measure the activity
of a gene by measuring the number of RNA copies of it in a cell.

Now imagine we're trying to reverse engineer an alien computer, and unlike
Independence Day[2], we're not dealing with Mac-compatible hardware here. In
programming, when you try to reverse engineer what someone else's code does, a
disassembly probably won't cut it. Similarly, knowing the genome will not be
enough: to _really_ make sense of all it all, to "debug" the DNA, we need to
see that "running code" in context, in relation to the whole organism.

What molecular biologists have been doing in recent years is measure the gene
expression (the number of RNA molecules) of individual genes in individual
cells, across as many cells as possible, at various stages of cell
development. Then they compare the gene expression levels with various
algorithms to organise them. Combined with extra meta-data, like what tissue
the cells came from, what cell-type we know the cell belonged to based on
morphology (in plain English: its visible shape in a microscope), and at which
point in development cell was harvested, we can then start painting a picture
of cell development and cell types.

They take cell tissues, separate it into individual cells, then measure the
expression of each gene in each cell. The techniques they have figured out
techniques, like droplet-based approaches[3], to do this in bulk are mind-
blowing. When given the budget, and in the hands of appropriately skilled
molecular biologists, you can now "easily" measure the expression of all genes
in hundreds of thousands of cells.

Doing this you can figure out the different cell types, which genes are
involved them, and which stage of development cell types start (dis)appearing.
But remember that we separated all of the cells, so we lost the context of the
original tissue. To fix this, you can take new tissue samples, and apply
techniques like smFISH or MERFISH[4] to to attach fluorescent markers to
_individual copies of RNA_ of a specific gene. Gene expression is measured
just by counting the number fluorescent dots on microscope slide, each
representing an individual RNA molecule. Yes, biologists actually do this
(with software help of course), and it works. You end up with a map of gene
expression in the tissue. Also the pictures look beautiful.

This is, in a nutshell, single-cell transcriptomics[5]. Or at least the part I
am exposed to through the research group I work for. It's mind-boggling and
awesome to see what has become possible in the span of just a few years, and
the field hasn't plateaued yet.

Thanks to all of this data we can discover all kinds of wonderful things that
were invisible before. A simple example: last year it was discovered that one
type of neuron in the sympathetic nervous system turned out to have _seven_
sub-types from the genetic point of view[6]. The hypothesis is that having
many sub-types makes sure they "wire up" correctly; you wouldn't want to have
goosebumps whenever your heart beats. Of course, the part of the article that
went viral was that we have a type of neuron solely responsible for goose
bumps and nipple erections...

Anyway, the speed and size at which data is being gathered is rapidly growing.
The article even mentions this explicitly. New techniques for dealing with
this growing mountain of data are developed rapidly too. For example, the
research group I work for just co-published a paper with Peter Kharchenko's
group, describing a technique to estimate in _which direction_ a cell's RNA
expression is moving[7]. So instead of just having scalar tSNE scatter plots,
we get tSNE _vector fields_ , showing us the direction in which cell
development is moving. And the coolest part is that it can be applied to
existing data sets, because it relies of various types of measurements that
were already being done anyway.

I'm just standing at the sidelines of all this, being incredibly impressed and
humbled that I get to try to contribute a little bit to all of this. I'm just
a programmer/interaction designer. Sten Linnarsson hired me to develop a user-
friendly, flexible and _fast_ web-based data browser, the Loom Viewer, for his
new Loom data format. But this post is long enough as is, to I'll reply to
myself with a comment explaining that.

Anyway, this is an exciting, ambitious project, that probably will have an
enormous impact on biology and medical research, on a much shorter time-scale
than you think.

[0] [http://linnarssonlab.org/](http://linnarssonlab.org/)

[1]
[https://en.wikipedia.org/wiki/Altair_8800](https://en.wikipedia.org/wiki/Altair_8800),
[https://www.youtube.com/watch?v=1FDigtF0dRQ](https://www.youtube.com/watch?v=1FDigtF0dRQ)

[2] [https://scifi.stackexchange.com/questions/15141/how-did-
the-...](https://scifi.stackexchange.com/questions/15141/how-did-the-computer-
virus-get-uploaded-into-the-mothership-in-independence-day#15143)

[3] [http://mccarrolllab.com/dropseq/](http://mccarrolllab.com/dropseq/),
[https://directorsblog.nih.gov/2015/06/02/single-cell-
analysi...](https://directorsblog.nih.gov/2015/06/02/single-cell-analysis-
powerful-drops-in-the-bucket/#more-4718)

[4] [http://thenode.biologists.com/fishing-
fish-2/resources/](http://thenode.biologists.com/fishing-fish-2/resources/),
[https://www.youtube.com/watch?v=-jIZ3bH-
rAE](https://www.youtube.com/watch?v=-jIZ3bH-rAE)

[5] [https://en.wikipedia.org/wiki/Single-
cell_transcriptomics](https://en.wikipedia.org/wiki/Single-
cell_transcriptomics)

[6] [http://ki.se/en/news/special-nerve-cells-cause-goose-
bumps-a...](http://ki.se/en/news/special-nerve-cells-cause-goose-bumps-and-
nipple-erection),
[http://www.nature.com/neuro/journal/v19/n10/full/nn.4376.htm...](http://www.nature.com/neuro/journal/v19/n10/full/nn.4376.html)

[7]
[https://www.biorxiv.org/content/early/2017/10/19/206052](https://www.biorxiv.org/content/early/2017/10/19/206052),
[http://velocyto.org/](http://velocyto.org/)

~~~
vanderZwan
So believe it or not, a lot of genomic research uses text-based formats. When
I first heard that my jaw dropped to the floor - we're talking about an
alphabet with _four letters_ , and you're using ASCII? Really?

Similarly, for gene expression, CSV files were pretty common, probably because
it's the most "neutral" format for tabular data. Obviously, this doesn't scale
if you are going to measure so much data that your table consists of tens of
thousands of genes by hundreds of thousands of cells. So my boss, professor
Sten Linnarsson decided that his group needed a more efficient file format.
And ideally, it would become adopted by other labs, so that everyone could
easily exchange information.

The result is the .loom file format[0]. It's open source, BSD licensed, and it
comes with a SciPy support library out of the box. It is HDF5-based[1], which
has the benefit of being an old, battle-tested system, and platform-
independent.

Another problem to address is that a lot of gene data is out there, but when
it comes to asking _simple_ questions you need to do a lot of work.
Researchers have to "anticipate" what kind of questions people might ask about
their data and put in the effort of creating a website for it. For example,
for one of the groups recent papers, they created a simple website that lets
you explore the gene expression of a single gene[2]. It turned out to be very
popular.

So he wanted a more generic viewer for the Loom format, that lets you quickly
explore the metadata en gene data and ask simple questions (for complicated
stuff you will always need to download the whole dataset). That is what I have
been building.

The Loom Viewer[0] is an SPA lets you explore Loom files without having to
download the whole data set. If, for example, you are interested in a few
dozen genes, but the data set is a hundred thousands cells and tens of
thousands of genes in size, it is both wasteful and time-consuming to have to
download all of that. You can run it off-line with Loom files you created
yourself, and you can use it to serve loom files to others.

The app is still very rough around the edges, but feel free to have a look.

It's a really fun project, my first dive into web-dev, which was a bit
frustrating at times because I basically had to teach myself everything
without any guidance or help. But I learned all kinds of cool things: the app
makes use of the canvas and typed arrays for relatively fast rendering[3] and
minimal bandwidth overhead (most comparable viewers download image data from
the server), has off-line storage for fast gene retrieval, stores the state
regarding the "view settings" in the URL so you can inherently share links,
etc. The great thing about working for a professor is that he lets you indulge
in building cool stuff for them.

[0] [https://github.com/linnarsson-lab/loompy](https://github.com/linnarsson-
lab/loompy)

[1]
[https://en.wikipedia.org/wiki/Hierarchical_Data_Format](https://en.wikipedia.org/wiki/Hierarchical_Data_Format)

[2] [http://linnarssonlab.org/cortex/](http://linnarssonlab.org/cortex/)

[2] [https://github.com/linnarsson-lab/loom-
viewer](https://github.com/linnarsson-lab/loom-viewer)

[3] In case you're wondering: _" why not WebGL?"_ Well, because my target
audience includes old professors who _print out_ websites. I actually need to
render 100 separate canvases; faking it with one WebGL overlay like regl does
here: [http://idyll-lang.org/idyll-regl-component/](http://idyll-
lang.org/idyll-regl-component/) would not survive the print command.

------
nonbel
While I like this project, it seems too ambitious for the current state of
affairs.

Start with counting how many cells of each type are present for various
tissues. Do this using cadavers of various ages, etc. Pretty much all we have
right now for such estimates are totally back of the napkin.

~~~
vanderZwan
> Start with counting how many cells of each type are present for various
> tissues.

I'm confused: what do you think this project is trying to achieve if not
exactly that?

~~~
nonbel
They are planning on sequencing, etc. That is all great but honestly I think
it is jumping the gun.

I am saying to simply count the cells, so we have the most basic of
information. Work on this only really began in 2013...[1] From (quickly)
reading their whitepaper it isn't clear to me whether they will even get the
count data.

[1]
[https://www.ncbi.nlm.nih.gov/pubmed/23829164](https://www.ncbi.nlm.nih.gov/pubmed/23829164)

~~~
vanderZwan
I'm sorry, maybe I'm missing something but what you're saying is not making
any sense to me. Furthermore it gives me the impression you completely
misunderstand the biological science behind this project.

I thought you meant figure out the number of different types of cells, but
you're actually saying _counting the numbers of cells of each type?_

You're putting the cart before the horse, since we haven identified each cell
type yet. That's what this project is doing!

What does counting _cells_ in tissues even mean here? _Which_ cells? If you
can't even tell the cell types apart, _what exactly would you be counting?_ Do
you think we already know every kind of cell type and what does? Because we
don't.

And even ignoring that: say that we figure out the liver has N hundred million
cells. On average, because tissue size _wildly_ varies per person. But hey,
sure, from a Pure Science perspective we could figure that out. What does that
information then tell us? Is it information in any way meaningful?

How does counting the number of cells give us a better understanding of what
that tissue does? Would you suggest we try to make sense of how countries are
organised and what kind of culture they contain by grouping them by population
size? Apparently Ghana and Nepal are identical according to this logic, as are
Cameroon and Taiwan, and Niger and Sri Lanka[0].

By comparison, determining cell types via gene expression is like measuring
education levels and which percentage of the population is in which
profession. I don't know about you, but when it comes to making sense of how
the body works, my money is on that.

[0]
[https://en.wikipedia.org/wiki/List_of_countries_and_dependen...](https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population)

~~~
nonbel
"Type" of cell is a human construct. There is no actual thing as a type of
cell, so the list is always complete. People could decide to have more or less
types depending on what they are working on, but there is no sense in trying
to delineate all the "types".

Currently we already have a classification system for cell types that is used
to distinguish between different types of cancer. They should use exactly that
same system.

Of course the numbers will vary by individual. What we want to get is basic
upper/lower bounds. It would probably be best to see how well # of cells
correlates with weight, skin area, etc and split the data into groups based on
whatever easily measured physiological parameter would work best.

Once you have number of cells at various ages, then _any_ biological theory
that requires the collective activity of many cells will need to be consistent
with these values. This can produce a major constraint on much theorizing,
especially in areas like cancer.

Say you have a theory that cancer is caused by multiple mutations (eg, n = 7)
accumulating in a cell lineage. You have other data about mutation rates per
basepair per division (eg, p = 1e-8). If this theory is correct you would need
a certain number of divisions to explain the curve of cancer incidence by age
for that tissue. That number of divisions can be estimated from the difference
in number of cells at various ages (or at least we can get some bounds on it
by ignoring cell death, etc) and compared to the value required by the cancer
theory.

~~~
vanderZwan
You sound like a phycisist who thinks he can explain the world in terms of
spherical cows, asserting a shit-ton of stuff without clear back-up as to why
what you state is true, acting as if all kinds of nuanced issues are simple
and/or solved problems, and assuming authority over a field you do not seem to
be part of.

For example:

> _Currently we already have a classification system for cell types that is
> used to distinguish between different types of cancer. They should use
> exactly that same system._

Why? You don't give any reason why this classification system is any better
than any other. Why would a system of classification of cancers be useful for
classifying all other cell types? Actually, asking this question is already
giving this statement too much credit: which of the various classification
systems are you even referring to[0]?

[0] [https://www.news-medical.net/health/Cancer-
Classification.as...](https://www.news-medical.net/health/Cancer-
Classification.aspx)

~~~
nonbel
I don't follow your hang up about classification systems. Different systems
are good for different purposes. There is no reason one is better than any
other, in general.

You are asking for something that makes no sense. EG, which is better: a
winter coat or a hoody? It depends on the situation. You can use either of
them, both of them together, or something else entirely.

For the purpose of my example, use whatever system SEER uses since that is the
biggest database of cancer data.

>"You sound like a phycisist who thinks he can explain the world in terms of
spherical cows, asserting a shit-ton of stuff without clear back-up as to why
what you state is true, acting as if all kinds of nuanced issues are simple
and/or solved problems, and assuming authority over a field you do not seem to
be part of."

I am saying literally count cells, how many are there that look like x, y, and
z? Classify x, y, z however it is already being done in the clinic, if there
is more than one method... then use more than one method. Each method probably
has its advantages and drawbacks.

What assumptions are you talking about? I am making pretty much none. Sit
there, count the cells to the best of your ability, that's it.

In fact the currently proposed project is going to be making all sorts of
assumptions behind the various assays. Each of these makes the data more
questionable. That is why I want basic stuff like # of cells that is least
likely to be messed up.

------
untilHellbanned
This will be useful, but similar to hype machine behind react or node.js,
molecular biology is jerked around by new technologies that confer unclear
value over existing approaches.

In this case, it’s single cell rna seq. I’d argue we never got very far with
bulk measurement RNA analysis because it’s not a functional technique, rather
than us not having single cell resolution.

Look at the nobel prizes. So many of them have simple genetics or biochemistry
at their core. It’s because those experiments were functional.

And before you tell me we haven’t yet had enough time for gene expression
studies to deliver a nobel prize, I’d say well it’s been almost 25 years.
Yamanka 4 factors is less than 10. CRISPR, also less than 10 should come soon
too.

