
AI in drug discovery is overhyped: examples from AstraZeneca, Harvard, Stanford - mostafab
https://medium.com/the-ai-lab/artificial-intelligence-in-drug-discovery-is-overhyped-examples-from-astrazeneca-harvard-315d69a7f863
======
anon1253
1) As someone who does NLP and AI for Evidence Based medicine [1] I can only
agree. I build natural language processing tools, together with the team, for
EBM. I have /no/ idea what to do next. Please. Help me.

2) Rant: here's the thing, we have SVMs, LSTM, BiLSTM-CRF, Knowledge Graphs,
28M articles, deploy infrastructure, tests, you know all the cool stuff. We
can't sell it. It's PubMed this, PubMed that. There's no breaking that spell.
So instead of building a better search engine, we focused on better analytics.
Is it cool, sure. Is it as good as all the KoL stuff from LexusNexis, nope.
Medicine is /hard/, if you think your app is hard: medicine is /harder/. Am I
trying to cover my ass, sure. But seriously, deal with everything being non-
standard and proprietary for long enough and you'll feel the same way. 80% of
my job is saying "no" to people who aren't going to pay in the first place
(Sanofi, Novartis, etc). They don't care. I'm just going to ride this out,
throw some blockchain in somewhere where it makes absolutely 0% sense, call it
a day. Seriously, NLP/AI for medicine: it's crap 100%. And, I'm a "senior
developer" with publications behind my ass on this. I don't know what to do
instead of riding the hype train. Seriously: anyone out there that knows what
to do with 100M+ medical publications in a 100M+ knowledge graph, be my guest:
I'd hire you in a heart beat. But it's hard and there is nothing to be gained
AFAIK.

[1]: [http://growthevidence.com/joel-kuiper/](http://growthevidence.com/joel-
kuiper/)

~~~
j7ake
Imagine you were given raw data from the large hadron collider and you were
asked to find interesting things in the data. What would you do ?

Any serious scientist would see the folly of trying to run only machine
learning algorithms to find the Higgs boson. What you would need (in addition
to ML algorithms for processing) is a good theory of what you're looking at.
This also means just applying computer science methods to the problem isn't
going to work, you need to inject theory from physics.

For drug discovery this means you need to actually do biology at some point in
order to make progress. ML strategies involve tight loop around trying
different models, implementing them, and checking to see if it does better
than other models. What's needed is then good experiments to test your models
and good models to make novel predictions.

There is no getting away from wet lab experiments in biology if you're going
to make any significant advances. The theory just isn't there to do purely
computational work.

~~~
anon1253
Exactly, and that uncovers the folly in this: what are you trying to do. Often
enough it's very specific, and requires actually going out into the world.
Sometimes the literature is already there. But nearly always it requires some
form of analytical thought. "I want X because I need to prove Y". There is no
AI magic sauce around that. It means that the people interacting with the
"user friendly" interface need to know, and need to know how to get there
somehow. That's rarely the case. And from a product perspective it makes
things very hard: you get bombarded with very specific requests that require
time and effort to build UIs and models for, that at the end the of the day
only solve that specific question. Instead, it would be better to equip people
with the tools needed to solve the meta-question so to speak, but that meta
question is much much harder (at least, I've been trying but no success yet).
As a funny anecdote: my medical knowledge had exploded. But I see that as a
symptom of the broader problem: I know how to do /some/ things because I've
gained that domain knowledge, but I can't teach the system yet to solve the
"general" problem because a) I don't know what it is b) there is no data to
get there c) what even is the UI for that.

~~~
ForrestN
Is this a fair way to express your problem?

You're being asked to use AI to replace an entire process all at once, rather
than to assist with or replace a step in that process.

Instead of being asked "how do I solve this problem" it would be nice if you
were asked "can you replace this thing I have to do all the time?"

Maybe this doesn't apply to your work, but if I were in your position I'd talk
to the human that does the human version of the process and break down
everything they do to get from A to Z into steps. If you can start by just
replacing or augmenting a single step, you're adding value right away and
you've also built a part of your eventual A to Z machine, right? For example,
your search engine idea is a good direction. Can you automate the process of
finding relevant articles? That would help the humans but would also help
whatever AI you're working up to, right? Then maybe further in there's some
other process involving physics modeling or something that is clear enough in
scope to do programmatically.

I do much simpler work, but I find it's much easier to automate processes one
step at a time rather than all at once.

------
chris_va
As someone who has worked on AI for drug discovery, I would say that the title
is correct, but not for the reasons stated. There is also some annoying
speculation in this document that is completely incorrect, and unsupported, so
I caution people when reading it.

Anyway, the primary reason that AI for drug discover is overhyped is that the
sort of problems AI is good at solving don't line up well with the unsolved
problems in the drug discovery pipeline.

This article, for example, focuses a lot on lead generation. Lead generation
is the easiest aspect of the problem to tackle using AI, and so most people
doing research start out trying to build a foundation in this space. However,
it doesn't actually represent the majority of the cost.

Drug makers typically spend about ~$800M on failed drugs for every ~$900M in
revenue. They aren't spending that $800M on leads, finding leads is fairly
easy. They are spending that money on drugs that fail in Phase 2 and Phase 3,
which is more about off-target side effects, bulk formulation and synthesis,
patient population differences, drug-drug interactions, etc.

It would be nice having better leads, but there aren't a shortage of them that
look good in vitro or even in vivo. It isn't until much later in the pipeline
that the costs really add up, and failures there are expensive. If we could
solve off-target side effects using AI, then we'd be in a whole different
ballgame. Having banged my head against it for a while, I think it is
possible, but will take a huge amount of investment.

The work this article talks about is more foundational, which is necessary but
should not really be taken as anything more.

~~~
aaavl2821
that is a great point.

for those not familiar with where the money really goes in R&D, and where the
biggest opportunities for improvement are, check out these charts [1] and [2]
from the seminal paper on calculating the cost of getting a drug approved [3].

the biggest areas to improve R&D productivity lie in 1) picking more validated
targets to reduce Phase 2 and 3 failure rate and 2) reducing the cost of lead
optimization (basically the process of turning a compound that has the desired
impact on the target (target is a molecule implicated in disease that you want
to effect with a drug) into a molecule with drug like properties (ie gets
where you need it in the body, is safe, can be manufactured and delivered
efficiently))

im not an AI person, but could AI play a role here? there are plenty of
validated targets that are "undruggable"; could AI help find as-yet-
undiscovered molecules that could engage these targets? or could AI somehow
make the med chem / lead optimization process easier?

[1] [https://media.nature.com/m685/nature-
assets/nrd/journal/v9/n...](https://media.nature.com/m685/nature-
assets/nrd/journal/v9/n3/images/nrd3078-f2.jpg) [2]
[https://media.nature.com/m685/nature-
assets/nrd/journal/v9/n...](https://media.nature.com/m685/nature-
assets/nrd/journal/v9/n3/images/nrd3078-f3.jpg) [3]
[https://www.nature.com/articles/nrd3078](https://www.nature.com/articles/nrd3078)

~~~
randcraw
The primary missions in drug development are efficacy and safety. AI can help
answer clear well-formed efficacy questions like, "Does this molecule fit a
chosen target molecule"? But it can't help with bigger efficacies like, "Will
hitting this target make a _sufficient_ difference in managing this disease"?
Or _any_ safety questions like, "Does the molecule also hit any other
molecules somewhere in the body (or population) that might screw up something
else"? And unfortunately it's safety failures that eat up 90% of drug
development costs (esp. in Phase III).

Until Wall St (or Sand Hill Rd) understands that domain-agnostic low-info
approaches like AI are incapable of answering complex questions that require
teams of PhDs steeped in decades of doing both chemistry and biology, the
notion of CADD will continue to miss the mark and waste megabucks.

------
maxander
The fundamental problem with AI (here meaning, neural-network based
applications) applied to biology is that _pattern recognition_ doesn't get you
very far. When designing a small molecule drug, for example, the goal is to
find a molecule that will slot into some protein (or other biomolecule) of
interest in just the right way; whether it does so is dependent on shape in
complex and impossible-to-predict ways. As an analogy, take the task of making
a key (drug) to open the front door of an apartment across town (disease-
relevant target); your "training data" (known drug-y molecules) are the keys
for a bunch of other apartments sampled mostly randomly from around town.
Obviously, you can make any number of plausible key-oid objects that would
pass visual inspection as potential keys to the target, but even a godly-
intelligent post-Bostromian superAI couldn't _solve_ the problem in any
principled way. You just need information about the actual lock you want to
open, plain and simple.

There are many, many, _many_ other open problems in biology, of course,
ranging all along the conceptual gamut, so I can't say there's _nothing_ where
AI would be a killer app. And there's also the sort of last-ditch null
argument that, well, being smart helps humans do biology, so computers that
are smart must be useful _somehow_. But typical progress in this field is from
patiently acquiring new data. There's no point where you know "enough" of the
picture to infer the rest with any degree of certainty- the degrees of freedom
in the "design" of biological systems is simply too huge.

~~~
steve_musk
I’m not sure this is a great analogy. If we didn’t know anything about the
locks (target binding sites) I would agree with you, but there is no reason to
enforce that restriction. A pipeline that given a target binding could predict
a number of potential drug structures with high accuracy would be extremely
useful. Going with your analogy, this would be like training a net using a
database of lock structures and their corresponding keys, and then having it
predict a key for a given lock. That seems fairly doable for current ML
techniques.

Obviously actual drug design is more complicated than this, and you have to
consider side effects. But I don’t think it’s as hopeless as your analogy.

~~~
maxander
> Going with your analogy, this would be like training a net using a database
> of lock structures and their corresponding keys, and then having it predict
> a key for a given lock.

If we have that, then the problem is entirely trivial- just look at the
lengths of the tumblers, and give the key teeth that match. And so it is, with
the corresponding increase in complexity, in biology- given a complete and
confident structure of the target, we have (non-AI, reasonably reliable)
chemical modelling methods to see whether and how a given compound will bind;
iterate molecules until you find one that works and you've found your "key"
quite cheaply. So, there's another limitation to the practicality of AI- when
it's not irreducibly complex and intrinsically unknown, biology is usually
mechanistic and deterministic enough that conventional approaches can reason
about it quite effectively without resorting to anything so fancy as a neural
network.

Biology is almost always more complicated than a given description, so yes,
the real problem is more complicated. As you mention, the key we create must
not only fit the target lock, but it must also not fit any _other_ locks that
might cause problems. Perhaps AI methods could try to make a working key that
is un-key-like as possible, to make it unlikely to fit off-target locks;
potentially useful. But, absent knowledge of the various perverse
configurations that locks in the wider world can have, the confidence the AI
method's inferences are very limited- roughly speaking, it's attempting to
make a classifier for a population that is mostly unknown. We might hope for a
severalfold increase in the number of successful drug candidates, in the best
case, but not a fundamental disruption of the pharmaceutical research process.

~~~
steve_musk
> given a complete and confident structure of the target, we have (non-AI,
> reasonably reliable) chemical modelling methods to see whether and how a
> given compound will bind; iterate molecules until you find one that works
> and you've found your "key" quite cheaply.

I am a researcher in the computational chemistry/biophysics field, and from
what I can tell this is not the case (would love to be corrected if I am
mistaken; my research is not directly related to drug design). There are
various computational ways to to test whether a molecule will bind to a target
site, but they are either too inaccurate or too computationally expensive to
claim that this problem is "solved" by chemical modeling methods. Pharma
companies are still wasting money sending drug structures that have been
validated by computational methods to organic chemists to synthesize that end
up not binding as expected.

------
Afforess
AI can't help much in drug discovery (i.e lead generation) because the
pharamecutical industry is suffering from Eroom's law (
[https://en.wikipedia.org/wiki/Eroom%27s_law](https://en.wikipedia.org/wiki/Eroom%27s_law)
). It's a play off Moore's law (spelled backwards), but the fundamental
problem is the pharma/medical industries do not understand biology from first
principles, unlike hard physical sciences, such as computer hardware
engineering.

First principles understanding of computing hardware allows chip manufacturers
to create novel, new chips from scratch, as we fully understand the physics
behind electricity, how logic gates work and signals propagate through
physical mediums. Lock most college students in a room with resistors, diodes,
wire, etc, and reference material, and they could recreate basic circuitry and
very simple computers. You can not do the same with biology - no student or
expert, given unlimited resources, equipment, or reference material can
construct a new living cell from scratch (proteins / molecules). This is not a
slander of those sciences and scientists, but there is simply a huge gap in
how well we understand the basic principles of life and how well we understand
the basic principles of computing.

Throwing more resources / computing into "drug discovery" is like trying to
build chips by wiring computer parts differently. Occasionally it might work
and produce a "useful" result, but it's fundamentally a broken approach.

~~~
efangs
YES. Exactly this.

It's fine to make approximations to avoid exponential scaling, but applying
function approximators essentially randomly won't get you anywhere. This is
then compounded by the fact that the functional framework you're starting from
is not a first-principles approach.

Until there is QMC for drug discovery, it will all be hype.

~~~
randcraw
_Mostly_ hype. Yes, automating drug discovery to any extent is utterly
hopeless, and as likely to impact the pharma business any time as autonomous
killer robots overrunning the battlefield -- Not In My Lifetime.

But AI definitely has a near term future in addressing well formed questions
like specific assays or searching for well-constrained targets, like ligand
matches. The trick is for the AI contributor TO LEARN SOMETHING ABOUT THE
DAMNED DOMAIN. Unless the chemist/biologist is intimately involved in the
task, the AI provider is shooting blind. But with many wise eyes on the ball,
even the hardest problems becomes a lot more assailable.

[I say this as someone who processes images and analyzes data within a big
pharma, and has seen several grand IT plans fail (like systems biology disease
modeling) and many small & specific scientist-assistance tasks succeed.]

------
frisco
Responding to his issues with Vijay Pande's work (I'm not affiliated), graph
convolutions really are better than char-RNN for this. There's a good
theoretical motivation for why, and people have spent a lot of time trying to
find better embeddings of chemical space. (Admittedly, still not great - I
wouldn't dispute the overall thesis that AI in drug discovery is still very
early.)

I did a project a year or so ago to reimplement one of Aspuru-Guzik's papers,
a variational autoencoder, ([https://github.com/maxhodak/keras-
molecules](https://github.com/maxhodak/keras-molecules)) and when I did that I
compared to char-RNN and the VAE did get much more interesting results. I also
saw results from other people around that time showing that using graph
convolutions on the front end instead of one-hot encoded SMILES strings was
even better.

Also, as for OP's objection about GANs not working because it's a "perfect
discriminator," this is an obvious result that becomes apparent after spending
20 seconds with the problem. (Eg, this thread here:
[https://github.com/maxhodak/keras-
molecules/issues/55](https://github.com/maxhodak/keras-molecules/issues/55)) I
haven't read the Harvard paper referenced but I'd be absolutely shocked if
this was lost on them. There are definitely ways to work through it.

~~~
mostafab
I am not sure to fully understand your remark, but if you have a benchmark
graph convolutions vs. char-CNN, it would be great to write your result and
post it on Arxiv. Pande will be interested ;) The problem is not with the
theoretical motivation, but with the empirical confirmation.

I also agree that there are many ways to work through the perfect
discriminator problem for ORGAN. But it remains to be done (afaik).

~~~
frisco
I misread char-CNN as char-RNN in the linked article. I don’t know of results
that compare graph convolutions to char-CNN. If a comparison were to be made,
what metrics should it focus on?

~~~
mostafab
See MoleculeNet ;)

------
jostmey
For the most part, I agree with the article's premise.

But they were pretty unfair to the Stanford lab (Vijay Pande). The Stanford
team is using graph convolutions to process molecular structures. The article
complains that the Stanford team is biased because they are not using
sequences to represent the molecular structures. But this actually makes a lot
of sense because molecular structures _are_ graphs. The article then goes on
to say that the Stanford team has an agenda to get everyone to use deepchem
instead of TensorFlow or PyTorch. But TensorFlow _is_ a dependency in
deepchem, and deepchem looks like a bunch of useful tools wrapped around
TensorFlow.

I'm not buying the articles arguments

~~~
blueblob
Agreed. Just because you don't compare your model to another model doesn't
mean you're hiding something and it's irresponsible to pretend that's the
case. Perhaps the author should have run char-cnn against data instead of
making baseless accusations.

On the whole, a lot of the arguments are realistic of ML in general. There is
far too much focus on individual problems and the accuracy measure without
enough focus on the interpretation of the accuracy, how well it will really
generalize, etc.

~~~
mostafab
In general, you are right. But in this particular case, char-CNN is the
standard used by many people, Stanford included. It is not sophisticated at
all.

Look at the big list of exotic models they tried:
[http://moleculenet.ai/models](http://moleculenet.ai/models)

For me, this omission is a negligence. It is selective laziness.

The author (me) does not have 20 PhD students, postdocs and startuppers under
his hand, to do the job for Stanford. With my limited resources, it is more
cost-effective to do my job against Stanford ;)

~~~
dekhn
It sounds like you have axe to grind against well-funded professors, not that
you actually have a valid argument.

~~~
mostafab
This professor gave an argument in his paper. I gave a refutation. If you
agree with him that char-CNN was a sophisticated model in early 2017, then you
are not well-informed about the situation in deep learning.

The MoleculeNet co-author gave another argument in the comments. My refutation
is that you can't claim to lack time after 8*20 man-months have passed.

------
lilleswing
DeepChem developer and MoleculeNet co-author here.

We are NOT trying to get vendor lock in for users of DeepChem. If there are
complaints about lock in using our tools please file a github issue and we can
work on trying to create an open easy to use API for everyone.

~~~
mostafab
From a user viewpoint, Deepchem would greatly benefit from being a better team
player with lower-level (Tensorflow) or other (Pytorch) frameworks. The pace
of research (in NLP in particular) is too fast to make it realistic to port
everything in Deepchem without an unreasonable delay.

Does it fit the Deepchem agenda? That's another question ;)

~~~
lilleswing
Thanks for the feedback (I love feedback). While that is a good high level
goal the devil is in the details of how to make it happen. Here are some
doable ideas in the medium term which might help.

1) Tutorial and documentation on how to access the raw Tensorflow graph when
using DeepChem

2) Tutorial of combining raw Tensorflow with our existing chemistry specific
layers

3) Different documentation quick-start sections for ML practitioners and
application practitioners.

4) Better overall documentation of our Chemistry Specific layers.

Would these ideas have made a better first user experience for you?

~~~
mostafab
Yes, especially 2)

------
JacobiX
The article mention the name of a new startup too many times to be convincing.

~~~
mostafab
Sorry for this inconvenience. Content without ads is rarely free.

------
aaavl2821
This comment is related to AI in healthcare services, not drug discovery
specifically, but is illustrative. I recently went to a talk by a sr exec at
one of the biggest health systems in the country. They were evaluating AI
tools to predict which patients would die in 18 months so they could design
the most patient friendly and cost effective end of life plan. None of the
algorithms they evaluated outperformed simply asking a doctor which patients
she thought would die in the next year

------
YeGoblynQueenne
Of course, "these days" (scine ~2014 by my reckoning) AI is synonymous with
Deep Learning (in the lay press anyway) but there are actually other
techniques still being researched, specifically theory- and knowledge-driven
techniques, from the field of symbolic machine learning.

Such techniques have long been used in medicine and biology with great
success, very notably in the example of the Robot Scientist, a system that
automates the scientific process end-to-end, in a biology context.

This is from the abstract of the Nature paper, from January 2004 [1]:

    
    
      The question of whether it is possible to automate the scientific process is of both great theoretical 
      interest1,2 and increasing practical importance because, in many scientific areas, data are being generated much 
      faster than they can be effectively analysed. We describe a physically implemented robotic system that applies 
      techniques from artificial intelligence3,4,5,6,7,8 to carry out cycles of scientific experimentation. The system 
      automatically originates hypotheses to explain observations, devises experiments to test these hypotheses, 
      physically runs the experiments using a laboratory robot, interprets the results to falsify hypotheses 
      inconsistent with the data, and then repeats the cycle. Here we apply the system to the determination of gene 
      function using deletion mutants of yeast (Saccharomyces cerevisiae) and auxotrophic growth experiments9. We built 
      and tested a detailed logical model (involving genes, proteins and metabolites) of the aromatic amino acid 
      synthesis pathway. In biological experiments that automatically reconstruct parts of this model, we show that an 
      intelligent experiment selection strategy is competitive with human performance and significantly outperforms, 
      with a cost decrease of 3-fold and 100-fold (respectively), both cheapest and random-experiment selection.
    

There's more info and especially links in the Wikipedia article [2].

_____________

[1] Functional genomic hypothesis generation and experimentation by a robot
scientist,
[https://www.nature.com/articles/nature02236](https://www.nature.com/articles/nature02236)

[2]
[https://en.wikipedia.org/wiki/Robot_Scientist](https://en.wikipedia.org/wiki/Robot_Scientist)

------
jcoffland
AI in X is overhyped. For all X.

~~~
toddwprice
Was jumping on to say exactly that. We are most likely somewhere close to the
"Peak of Inflated Expectations" for the current iteration of AI. It's useful,
just not a magic bullet to solve all problems.

[https://en.wikipedia.org/wiki/Hype_cycle#/media/File:Gartner...](https://en.wikipedia.org/wiki/Hype_cycle#/media/File:Gartner_Hype_Cycle.svg)

------
isoprophlex
I'm not qualified to discuss drug discovery, but my eye was caught by the
authors statement that char-RNN on SMILES strings might outperform the use of
a coulomb matrix. IMO this is really misguided as a SMILES string does not
contain specific geometrical data (bond angles, dihedrals etc) of a
compound...

Learning the grammar of smiles-representation is not the same as learning to
use inter atomic distances.

~~~
dekhn
SMILES implicitly stores the geometry data. It's technically true it doesn't
encode a unique 3D structure (for a rigid molecule), but you could convert the
implict bonds to standard-length ones, embed the molecule in 3D space, and
minimize it (that's precisely what SMILES to 3D structure systems do).

------
ouid
calling any existing technology "AI" is overhyped

~~~
theophrastus
I don't know about " _any_ existing technology" but in any aspect of drug
discovery with which i'm aware, i'd agree. My job amounts to using statistics
and linear programming to winnow drug candidates out of databases of hundreds
of millions of small molecules, (sometimes: "chemoinformatics"). And i've lost
count of how many times i've had the marketing department insist on calling
what i do "AI". (and let me assure you, as far as pushing back against
marketing is concerned "all resistance is futile")

~~~
mkagenius
totally random thought: do you have pictures of molecules by any chance - can
be a good source to train some CNN on it and existing drugs.

Disclaimer: I have zero experience in drug discovery.

~~~
theophrastus
I don't know how it can help train whatever 'CNN' stands for (i'm guessing not
the news network, but something Neural Network?) Yet there are vast sources of
various ("pictures") representations of small molecules all over the internet;
from line-drawings, 2d stick models[1] to those providing rotation to space
filling models[2]. As a very personal aside, neural networks are just a means
of doing curve fitting without necessary learning anything.

[1]
[https://en.wikipedia.org/wiki/Morphine](https://en.wikipedia.org/wiki/Morphine)

[2] [https://www.cancerquest.org/patients/drug-
reference/morphine](https://www.cancerquest.org/patients/drug-
reference/morphine)

~~~
AmericanSeal
It stands for convolutional neural network, which is a NN architecture that
works well on image data. They use CNN's in object detection and other
computer vision related tasks.

------
brndnmtthws
Anyone who has actually built software using what the masses think is
'artificial intelligence' knows exactly how artificial and overhyped ML and AI
are. Neural networks are a far cry from the magic they've been made out to be,
and are nowhere close to as sophisticated as the brains of most animals.
Cherry picking results makes for good marketing material, but that's pretty
much where the usefulness ends.

We still don't even understand how exactly brains work.

~~~
tensor
Neural networks really are pretty magical for a lot of image based problems
these days. The impact for text problems is so far decidedly not magical,
though there are starting to be some tantalizing results in my opinion.

~~~
brndnmtthws
More gimmick than magic. At best they can do a crappy job of emulating
specific human behaviours on carefully curated training datasets.

~~~
infinite8s
What about Alpha Go Zeros approach of reinforced learning based on self play?
Arguably the game of go has really simple, explicit rules, and evaluating the
final end state is also easy, but it shows that with enough compute NNs can
identify important features without supervision or labeling.

~~~
brndnmtthws
What about what about

Given enough time, energy, and computing power, you could just precompute
every possible outcome and call it a neural network. You'll be praised as a
genius.

~~~
chillee
Ah yes, I forgot. Deepmind just secretly computed all 10^100 board states for
alphago...

Neural nets have been very successful for a lot of things while also being
overhyped, but dismissing it all as marketing hype is throwing the baby out
with the bathwater.

------
lurr
AI is overhyped is probably an equally valid title.

