
Building an AI to predict human age from a blood sample - ruborcalor
https://colekillian.com/post/methylation-age-prediction/
======
siscia
When I see work like these it get me the impression that ML hype is way too
real.

The goodness of the result of a machine learning model like this one should be
compared with the goodness of a simple "standard" model like linear
regression.

Yeah it is kinda cool that we can use 10 lines of TF to spin up huge
computation, but I guess that a simple linear regression would have provide
results that are at least similar to the one of the neural network.

~~~
fock
that's exactly the thing I'm seeing in my field (computational materials
science). Basically a simple regression model (with very simple features)
brings you 90% there, still people compete on publishing (on ONE shitty
benchmark dataset) ever better results – the most-cited people are using KRR,
e.g. each fit uses 2TB of RAM and "days" of CPU (features of length O(1000)
and 100000 samples). The sample data is probably a 30s calculation (and still
only a rough estimation). Sometimes wants you to question science, but hey,
writing proposals with "ML" in it gives at least a chance on that grant...

~~~
KaoruAoiShiho
I haven't thought as much about it as you have probably, but ok you get 90% of
the way there, but now what? How do you get that last 10%? It would be a huge
amount of work right?

~~~
hef19898
Isn't that usually the case? That the first 90% are as difficult to achieve as
the next 5%, which are again as hard as the next 3%?

~~~
KaoruAoiShiho
Basically the argument is that ML would be able to get there easier.

------
chewxy
>Back to the computer science: 470,000+ features sounds nice at first, but is
a recipe for overfitting when we only have 700 samples at our disposal.

Proceeds to use (1024^2 * 2 + 1024) parameters in the neural network.

~~~
deepnotderp
I know this is a joke, but the theory of generalization in NNs is rapidly
advancing and it's not quite that simplistic:
[https://arxiv.org/abs/2003.02139](https://arxiv.org/abs/2003.02139)

~~~
chewxy
Ya. And the choice of optimizer (in this case adam) also imposes upon it some
regularization scheme.

I just thought I'd highlight a bit of funniness.

~~~
profunctor
How does Adam provide regularisation? I’d never heard of this before and I
don’t recall it from when I read the paper.

------
Laurentvw
For anyone interested, there's a tool
[http://www.aging.ai](http://www.aging.ai) which does exactly that, using
deep-learning algorithms on, and I quote, "hundreds of thousands anonymized
human blood tests".

I've used it myself for fun after doing a blood test. It's a free alternative
to InsideTracker's InnerAge product.

~~~
ruborcalor
Wow very cool tool! I didn't know about this somehow. Thanks for sharing. Cool
that you've tried it on yourself.

------
minimaxir
With respect to having more features than samples, see also the Curse of
Dimensionality:
[https://en.wikipedia.org/wiki/Curse_of_dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality)

------
gmtx725
why would you use a neural net on a dataset with only 700 samples, smh

~~~
wyxuan
The relationship in the dataset seems pretty clean and clear cut, so you don't
really need as large a dataset.

~~~
0x1221
If it's clean and clear cut you probably don't need deep learning either.

------
all_blue_chucks
It occurs to me that antibodies for diseases would make an interesting
approach to age estimation. In the 1918 flu older people were spared. This is
presumed to be due to the fact that they had an immunity due to an exposure in
their own youths.

~~~
echelon
Interesting, but strewn with potential challenges.

Over time, cell populations with BCR/TCR that recognize and bind such antigens
will cease proliferation. Moreover, some cell populations will be localized to
certain tissues and not in circulation.

~~~
wyxuan
Yeah, and it would be impossible to determine the age with a resolution better
than a decade (at best).

------
n_2
Good to see other people interested in this!

Our startup (Chronomics) has built the most accurate epigenetic clock from
Saliva (no needles..) which looks at 20 million positions (or features)
[https://www.chronomics.com/science](https://www.chronomics.com/science)

Really interesting area and we are starting to be able to define many more
novel indicators of actionable health risks such as smoke exposure, alcohol
consumption and metabolic status from DNA methylation.

~~~
ruborcalor
Wow very cool startup; wish you guys the best of luck!

------
Znafon
When you post an article, please don't publish the code somewhere an account
must be created in order to read it. In this case, I cannot check the full
code because it is hosted at
[https://colab.research.google.com](https://colab.research.google.com), a
tarball attached to the article or a publicly accessible host like gitlab.com
or github.com would have been fine.

~~~
ruborcalor
Definitely a good tip. I'll try not to make that mistake again.

You can now find the jupyter notebook code here:
[https://github.com/Ruborcalor/Age-Prediction-Via-Blood-
Sampl...](https://github.com/Ruborcalor/Age-Prediction-Via-Blood-Sample-
Methylation-Profiles)

------
cdrake
Interesting write up. I'd be interested to see how it performs with k-folds
validation as well as shuffling. Kind of worried its learning order or
samples.

~~~
ruborcalor
Thanks I really appreciate it.

I'll try and get back to you with the performance of k-folds validation and
shuffling.

I don't think it can be learning the order or samples because the train and
test data sets are separated very early on. If it were learning order or
samples of the training set it would have to perform very poorly on the test
set.

~~~
cdrake
This had been bugging me all day in the back of my head... turns out shuffle
is enabled by default. Both in sklearn and in tf.keras (also original keras).

On a separate note, I think there may be a source file missing in your
notebook. I kept getting an error when trying to load
"GSE87571_series_matrix.csv". Might just be me.

[sklearn ref]([https://scikit-
learn.org/stable/modules/generated/sklearn.mo...](https://scikit-
learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train_test_split#sklearn.model_selection.train_test_split))

[tf.keras]([https://www.tensorflow.org/api_docs/python/tf/keras/Model#fi...](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit))

------
cletus
So I'm not ML guru or anything but what I learned was that if you have m
features on n samples you want n > m to prevent over-fitting, no?

Also, with so few samples, how do you do your hyperparameter tuning and
validation?

I mean you could eliminate certain features in isolation but that doesn't
capture dependent features. And how would you do dimensionality reduction?

~~~
ruborcalor
Yes I agree you want number of samples to be greater than number of features.
This is why the number of features were reduced from over 400,000 to 25. After
this reduction the number of features is less than the number of samples
(~700).

Honestly I didn't prioritize hyperparameter tuning enough. I pretty much went
with one of the first models I identified.

Could you elaborate on the idea of not capturing dependent features please?

------
ChaseT
For reference, the generally accepted standard for determining age from blood
is the Horvath clock [1]. It seems to be accurate and only uses a penalized
regression. Keep in mind this represents what your age is in reference to a
"healthy" person. For example, a 50-year-old who smokes may have the
equivalent practical age of a 60-year-old who doesn't. The Horvath clock is
useful for evaluating lifestyle changes and your overall healthspan.

If people want to learn more about how DNA methylation relates to aging, I
recommend reading Lifespan by David Sinclair.

[1] [https://www.semanticscholar.org/paper/DNA-methylation-
aging-...](https://www.semanticscholar.org/paper/DNA-methylation-aging-
clocks%3A-challenges-and-Bell-Lowe/cd94f6ce0c1bc108aa164fb261e2c0b82388b888)

~~~
ruborcalor
Yes the Horvath Clock seems to be the standard.

Thanks for sharing the book i'll have to sheck it out!

------
manthideaal
The idea of selecting the 25 features based on maximum correlation seems to be
weak because it should introduce a lot of collinearity. In chapter 6 of the
ISLR book there are many methods to work in high dimension, that is when
number of features is bigger than number of samples. For example principal
components regression, partial least squares, the lasso, ridge regression,
forward stepwise selection and PCL. All of those methods can be used with 10
or so lines of R using the packages and examples described in the ISLR book,
lab in chapter 6.

------
starchild_3001
> Therefore I first split the data into training and testing sets at a ratio
> of 9:1, and selected the 25 most correlated features in the training set.
> Each of these features had a correlation with age between 0.83 and 0.94.

> The data was then split into training and testing sets at a ratio of 9:1,
> and fed into a sequential neural network.

What?? I thought it was split already.

(How training and test sets were obtained sounds fairly confusing. Did the
author make sure there's no "data snooping" ?)

~~~
ruborcalor
I apologize for the confusion, i've since removed this typo.

The data is only split once, _before_ using a correlation test to select the
features that the model would be trained on. As far as I can tell there is no
data snooping occurring because the data is split into train and test sets
before any decision are made.

------
imvetri
Is the underlying concept related to 'By analyzing proteins in the blood, one
can estimate a person's biological age, as well as weight, height, and hip
circumference, mentioned in this article ?

[https://www.dailymail.co.uk/sciencetech/article-3349739/Woul...](https://www.dailymail.co.uk/sciencetech/article-3349739/Would-
want-know-old-body-REALLY-Blood-test-claims-able-reveal-biological-age.html)

~~~
ruborcalor
The two concepts definitely may be related, but the approach used in this
paper doesn't make use of proteins in the blood. Rather it uses the DNA
methylation extracted from white blood cells in the blood.

Interesting article thanks for sharing.

------
webo
For some context, the author is an undergraduate student.

~~~
ruborcalor
Haha yes good point take everything with a grain of salt

~~~
webo
I don’t know if you’re the OP but I didn’t mean it in a negative way. This is
extremely well written and researched, better than most graduate student
writings yet alone non-academics.

------
echelon
Far-reaching prediction: they're going to do facial prediction from blood
samples as well. Law enforcement really wants to generate sketches from
unknown DNA found at crime scenes.

That is, of course, in addition to the all-encompassing family trees we're
providing them with 23andme.

~~~
cjbprime
In what sense in this a prediction? The challenge of estimating faces from DNA
is one that researchers have already been competing and publishing papers on
for years:
[https://www.pnas.org/content/early/2017/08/29/1711125114](https://www.pnas.org/content/early/2017/08/29/1711125114)

------
peter303
What if I transfuse blood from healthy young subjects? That was a claim from
some startups, a wacky VC or two, and even a joke in the Silicon Valley HBO
show. The rumor mill says this is happening at a low level.

~~~
ruborcalor
Interesting point, the idea of transfusing blood from healthy subjects had
never occurred to me.

I'm not sure I understand; what was the claim from the startups, and what does
the rumor mill say is happening at a low level?

------
unwoundmouse
Great work! I love how well presented all of the information is.

I'd be really interested to see how well a baseline linear model using those
features would perform - it seems like it could do pretty well.

~~~
jonathankoren
A linear model should always be compared to these DNNs.

There was a paper last year or so that compared correctly tuned linear models
to various deep belief net papers and found that the performance "gains"
suddenly evaporated or were not nearly as great as originally published.

If I can track down that paper, I'll post it.

------
martopix
Looks like a beginner-level sklearn task. Linear regression would probably be
ok, if not, there is random forest or a 2-layer perceptron. No use for a deep
network.

~~~
ruborcalor
I'll try out these models and get back to you with their effectiveness.

------
modelzero
Cool, but can we not call this AI?

~~~
ruborcalor
What would you prefer to call it? Machine learning?

------
chengangcs
methylation information by sequencing the DNA in the blood is good enough to
predict age

~~~
keymone
Is it faster/cheaper though?

