2) Rant: here's the thing, we have SVMs, LSTM, BiLSTM-CRF, Knowledge Graphs, 28M articles, deploy infrastructure, tests, you know all the cool stuff. We can't sell it. It's PubMed this, PubMed that. There's no breaking that spell. So instead of building a better search engine, we focused on better analytics. Is it cool, sure. Is it as good as all the KoL stuff from LexusNexis, nope. Medicine is /hard/, if you think your app is hard: medicine is /harder/. Am I trying to cover my ass, sure. But seriously, deal with everything being non-standard and proprietary for long enough and you'll feel the same way. 80% of my job is saying "no" to people who aren't going to pay in the first place (Sanofi, Novartis, etc). They don't care. I'm just going to ride this out, throw some blockchain in somewhere where it makes absolutely 0% sense, call it a day. Seriously, NLP/AI for medicine: it's crap 100%.
And, I'm a "senior developer" with publications behind my ass on this. I don't know what to do instead of riding the hype train. Seriously: anyone out there that knows what to do with 100M+ medical publications in a 100M+ knowledge graph, be my guest: I'd hire you in a heart beat. But it's hard and there is nothing to be gained AFAIK.
Any serious scientist would see the folly of trying to run only machine learning algorithms to find the Higgs boson. What you would need (in addition to ML algorithms for processing) is a good theory of what you're looking at. This also means just applying computer science methods to the problem isn't going to work, you need to inject theory from physics.
For drug discovery this means you need to actually do biology at some point in order to make progress. ML strategies involve tight loop around trying different models, implementing them, and checking to see if it does better than other models. What's needed is then good experiments to test your models and good models to make novel predictions.
There is no getting away from wet lab experiments in biology if you're going to make any significant advances. The theory just isn't there to do purely computational work.
You're being asked to use AI to replace an entire process all at once, rather than to assist with or replace a step in that process.
Instead of being asked "how do I solve this problem" it would be nice if you were asked "can you replace this thing I have to do all the time?"
Maybe this doesn't apply to your work, but if I were in your position I'd talk to the human that does the human version of the process and break down everything they do to get from A to Z into steps. If you can start by just replacing or augmenting a single step, you're adding value right away and you've also built a part of your eventual A to Z machine, right? For example, your search engine idea is a good direction. Can you automate the process of finding relevant articles? That would help the humans but would also help whatever AI you're working up to, right? Then maybe further in there's some other process involving physics modeling or something that is clear enough in scope to do programmatically.
I do much simpler work, but I find it's much easier to automate processes one step at a time rather than all at once.
It' actually the obvious thing to do.
Reason? Lack of theory. We don't understand enough about what causes something to work in one case, but not another.
There's just no getting away from theory, in the end.
Often it's more productive to simply throw in a bunch of bias early and use a closed form method that's familiar, sufficient, quantified, straightforward, and interpretable. This becomes all the more likely when the quantity and quality of training data is limited or variable. Scientists we serve aren't especially keen for us to play around with cool new toys when they're on a tight schedule.
That said, there do exist a number of problems that are so noisy or contain so many signals that finding some sort of low bias method that's also domain agnostic could be a big win. Fortunately these kinds of problems are famously hard (like protein conformation prediction, or 3D shape modeling) so if we can make even a little improvement over the status quo using DL/ML, our scientists are usually willing to put up with the higher costs (and often lower punctuality) in our development of novel learned methods.
Often scientists want to play around with new tech too. Working with them to exploring novel approaches (like deep learning) to serve their projects' "stretch objectives" is sometimes an opportunity for us both to get ambitious and try out something new.
Every scientist I know has a desk covered in papers (presumably, the select few that were important enough to print out) and even for experts it takes time to glean information from densely-written things like that; there's definitely a problem there to be solved.
AI researchers too would like to have such a tool. It's getting hard to keep up with the torrent of papers. But such a tool would require actual understanding, thus, we get to have a subproblem which is more difficult than the original problem of finding information in text.
You can build an entire CDS from the data, which is publicly available.
And if you can't find a problem to solve, then why don't you ask where the pain is? There are a _ton_ of low hanging fruit items that NLP and Deep Learning that can help with _today_ in the clinical environment. How about:
Question-answering for doctors that immediately need to find time-critical facts from a patients medical record? My sister, a hospitalist, begs for help in this area when treating seriously ill patients that come into the ER from different hospitals.
How about helping insurers better model their patient populations from unstructured data?
ICD coding for medical billing, our extremely wasteful 16-20billion dollar a year industry with _terrible_ NLP solutions in the current market?
There are problems to be solved. You just need to ask your customers, and find their pain.
I'm not in the NLP field, but this does seem like a legitimate use
Find what the users do with the data and make it easier for them using modern tools.
2) Medecine is a legal minefield. There is definitely a place for AI to cure illness and improve health. It's hard to get the data, and to standardized in it shape, but I'm convinced that just by the mere fact that it will have more personal data and vastly more general knowledge that AI generalist doctor will trounce any generalist doctor even with simple predictors. The real value is not to cure you when you are ill, but preventively to keep you healthy by continuous monitoring and adapted personal recommendation. Look at performance enhancing equipment for athletes optimal training for example.
There are plenty of things you can do with your graph database (probably more with the metadata than with the data) which can provide value, you can use it to make recommendations of paper to read (like arxiv sanity). Use it to connect researchers, for example if one of your co-paper author work with another researcher then you can suggest a work relation. You can cluster researchers. You can rank researchers according to a multitude of criterion (like an ELO score for researcher). You can create recommendation so that you reduce isolation between clusters and promote information dispersion.
Google Scholar is pretty much the only place you can go for that sort of thing, and it's distance from things like Microsoft Academic Search are leagues.
I'd kill for a non-Google academic search, and smarter Google Scholar/PubMed alerts would be appreciated.
Based on the link you provided, it appears to be a cloud tool that allows someone to generate end reports comparable with uptodate.com. Is this the case?
A domain specific vertically integrated search engine seems like a great idea but probably requires being bought by Bing or Google to incubate while both the world and the technologies develop.
Are the inputs to your knowledge graph weighted by reliability? If not, the knowledge graph is unuseful.
Anyway, the primary reason that AI for drug discover is overhyped is that the sort of problems AI is good at solving don't line up well with the unsolved problems in the drug discovery pipeline.
This article, for example, focuses a lot on lead generation. Lead generation is the easiest aspect of the problem to tackle using AI, and so most people doing research start out trying to build a foundation in this space. However, it doesn't actually represent the majority of the cost.
Drug makers typically spend about ~$800M on failed drugs for every ~$900M in revenue. They aren't spending that $800M on leads, finding leads is fairly easy. They are spending that money on drugs that fail in Phase 2 and Phase 3, which is more about off-target side effects, bulk formulation and synthesis, patient population differences, drug-drug interactions, etc.
It would be nice having better leads, but there aren't a shortage of them that look good in vitro or even in vivo. It isn't until much later in the pipeline that the costs really add up, and failures there are expensive. If we could solve off-target side effects using AI, then we'd be in a whole different ballgame. Having banged my head against it for a while, I think it is possible, but will take a huge amount of investment.
The work this article talks about is more foundational, which is necessary but should not really be taken as anything more.
for those not familiar with where the money really goes in R&D, and where the biggest opportunities for improvement are, check out these charts  and  from the seminal paper on calculating the cost of getting a drug approved .
the biggest areas to improve R&D productivity lie in 1) picking more validated targets to reduce Phase 2 and 3 failure rate and 2) reducing the cost of lead optimization (basically the process of turning a compound
that has the desired impact on the target (target is a molecule implicated in disease that you want to effect with a drug) into a molecule with drug like properties (ie gets where you need it in the body, is safe, can be manufactured and delivered efficiently))
im not an AI person, but could AI play a role here? there are plenty of validated targets that are "undruggable"; could AI help find as-yet-undiscovered molecules that could engage these targets? or could AI somehow make the med chem / lead optimization process easier?
Until Wall St (or Sand Hill Rd) understands that domain-agnostic low-info approaches like AI are incapable of answering complex questions that require teams of PhDs steeped in decades of doing both chemistry and biology, the notion of CADD will continue to miss the mark and waste megabucks.
I had to do Tox21 I was using Bayesian Network for drug toxicity at the FDA we used the DILI database.
Can you talk more about what you've been thinking about here?
1) SAAS in the pharma world is mostly a waste of time. Culturally, they don't want to pay for anything except drugs. There is also a culture of sunk cost, where they do not want to prune drugs from their pipeline based on what some piece of software says.
2) This is a boil the ocean approach, which does not work statistically. There are 20,000 targets. Predicting bioavailability at each target is very difficult, and different populations have different expression patterns. Even if you have 99% precision/recall for each one, odds that you can help with selection enrichment are infinitesimal. Even if you restrict it to a handful of targets with strong known side effects, the state of the art predications are still not good enough to meaningfully improve the outcomes.
There are better approaches, nothing easy though.
It is hard to look at a single company to see how the money is spent on failed drugs.
There are many, many, many other open problems in biology, of course, ranging all along the conceptual gamut, so I can't say there's nothing where AI would be a killer app. And there's also the sort of last-ditch null argument that, well, being smart helps humans do biology, so computers that are smart must be useful somehow. But typical progress in this field is from patiently acquiring new data. There's no point where you know "enough" of the picture to infer the rest with any degree of certainty- the degrees of freedom in the "design" of biological systems is simply too huge.
Obviously actual drug design is more complicated than this, and you have to consider side effects. But I don’t think it’s as hopeless as your analogy.
If we have that, then the problem is entirely trivial- just look at the lengths of the tumblers, and give the key teeth that match. And so it is, with the corresponding increase in complexity, in biology- given a complete and confident structure of the target, we have (non-AI, reasonably reliable) chemical modelling methods to see whether and how a given compound will bind; iterate molecules until you find one that works and you've found your "key" quite cheaply. So, there's another limitation to the practicality of AI- when it's not irreducibly complex and intrinsically unknown, biology is usually mechanistic and deterministic enough that conventional approaches can reason about it quite effectively without resorting to anything so fancy as a neural network.
Biology is almost always more complicated than a given description, so yes, the real problem is more complicated. As you mention, the key we create must not only fit the target lock, but it must also not fit any other locks that might cause problems. Perhaps AI methods could try to make a working key that is un-key-like as possible, to make it unlikely to fit off-target locks; potentially useful. But, absent knowledge of the various perverse configurations that locks in the wider world can have, the confidence the AI method's inferences are very limited- roughly speaking, it's attempting to make a classifier for a population that is mostly unknown. We might hope for a severalfold increase in the number of successful drug candidates, in the best case, but not a fundamental disruption of the pharmaceutical research process.
I am a researcher in the computational chemistry/biophysics field, and from what I can tell this is not the case (would love to be corrected if I am mistaken; my research is not directly related to drug design). There are various computational ways to to test whether a molecule will bind to a target site, but they are either too inaccurate or too computationally expensive to claim that this problem is "solved" by chemical modeling methods. Pharma companies are still wasting money sending drug structures that have been validated by computational methods to organic chemists to synthesize that end up not binding as expected.
First principles understanding of computing hardware allows chip manufacturers to create novel, new chips from scratch, as we fully understand the physics behind electricity, how logic gates work and signals propagate through physical mediums. Lock most college students in a room with resistors, diodes, wire, etc, and reference material, and they could recreate basic circuitry and very simple computers. You can not do the same with biology - no student or expert, given unlimited resources, equipment, or reference material can construct a new living cell from scratch (proteins / molecules). This is not a slander of those sciences and scientists, but there is simply a huge gap in how well we understand the basic principles of life and how well we understand the basic principles of computing.
Throwing more resources / computing into "drug discovery" is like trying to build chips by wiring computer parts differently. Occasionally it might work and produce a "useful" result, but it's fundamentally a broken approach.
It's fine to make approximations to avoid exponential scaling, but applying function approximators essentially randomly won't get you anywhere. This is then compounded by the fact that the functional framework you're starting from is not a first-principles approach.
Until there is QMC for drug discovery, it will all be hype.
But AI definitely has a near term future in addressing well formed questions like specific assays or searching for well-constrained targets, like ligand matches. The trick is for the AI contributor TO LEARN SOMETHING ABOUT THE DAMNED DOMAIN. Unless the chemist/biologist is intimately involved in the task, the AI provider is shooting blind. But with many wise eyes on the ball, even the hardest problems becomes a lot more assailable.
[I say this as someone who processes images and analyzes data within a big pharma, and has seen several grand IT plans fail (like systems biology disease modeling) and many small & specific scientist-assistance tasks succeed.]
Craig Venter did exactly this.
"Even Venter acknowledges that syn3.0’s genome, although new, was designed by trial and error, rather than being based on a fundamental understanding of how to build a functioning genome."
I did a project a year or so ago to reimplement one of Aspuru-Guzik's papers, a variational autoencoder, (https://github.com/maxhodak/keras-molecules) and when I did that I compared to char-RNN and the VAE did get much more interesting results. I also saw results from other people around that time showing that using graph convolutions on the front end instead of one-hot encoded SMILES strings was even better.
Also, as for OP's objection about GANs not working because it's a "perfect discriminator," this is an obvious result that becomes apparent after spending 20 seconds with the problem. (Eg, this thread here: https://github.com/maxhodak/keras-molecules/issues/55) I haven't read the Harvard paper referenced but I'd be absolutely shocked if this was lost on them. There are definitely ways to work through it.
I also agree that there are many ways to work through the perfect discriminator problem for ORGAN. But it remains to be done (afaik).
But they were pretty unfair to the Stanford lab (Vijay Pande). The Stanford team is using graph convolutions to process molecular structures. The article complains that the Stanford team is biased because they are not using sequences to represent the molecular structures. But this actually makes a lot of sense because molecular structures are graphs. The article then goes on to say that the Stanford team has an agenda to get everyone to use deepchem instead of TensorFlow or PyTorch. But TensorFlow is a dependency in deepchem, and deepchem looks like a bunch of useful tools wrapped around TensorFlow.
I'm not buying the articles arguments
On the whole, a lot of the arguments are realistic of ML in general. There is far too much focus on individual problems and the accuracy measure without enough focus on the interpretation of the accuracy, how well it will really generalize, etc.
Look at the big list of exotic models they tried: http://moleculenet.ai/models
For me, this omission is a negligence. It is selective laziness.
The author (me) does not have 20 PhD students, postdocs and startuppers under his hand, to do the job for Stanford. With my limited resources, it is more cost-effective to do my job against Stanford ;)
The MoleculeNet co-author gave another argument in the comments. My refutation is that you can't claim to lack time after 8*20 man-months have passed.
1. I agree that graph convolutions make a lot of sense. However, it is not shown yet that they are better than SMILES, although it might just be a matter of time. A lot of graph variants are possible, or maybe we should look at molecules not as graphs, but as quantum objects (See the other Stanford paper on atom convolutions).
2. Yes, Deepchem is built on top of Tensorflow. This additional layer looks useful, but is it really the case? Using standard NLP models is simpler than using deepchem models, and NLP models might still be the state-of-the-art for chemistry tasks, via SMILES (until we get a reasonably comprehensive benchmark, which MoleculeNet is not).
I am not buying into a chemistry-specific library, until it is shown to be really necessary.
We still don't even understand how exactly brains work.
AFAIK we don't know how much of that generalizes.
Given enough time, energy, and computing power, you could just precompute every possible outcome and call it a neural network. You'll be praised as a genius.
Neural nets have been very successful for a lot of things while also being overhyped, but dismissing it all as marketing hype is throwing the baby out with the bathwater.
...With better performance than human, that is pretty magical.
We are NOT trying to get vendor lock in for users of DeepChem. If there are complaints about lock in using our tools please file a github issue and we can work on trying to create an open easy to use API for everyone.
Does it fit the Deepchem agenda? That's another question ;)
1) Tutorial and documentation on how to access the raw Tensorflow graph when using DeepChem
2) Tutorial of combining raw Tensorflow with our existing chemistry specific layers
3) Different documentation quick-start sections for ML practitioners and application practitioners.
4) Better overall documentation of our Chemistry Specific layers.
Would these ideas have made a better first user experience for you?
Such techniques have long been used in medicine and biology with great success, very notably in the example of the Robot Scientist, a system that automates the scientific process end-to-end, in a biology context.
This is from the abstract of the Nature paper, from January 2004 :
The question of whether it is possible to automate the scientific process is of both great theoretical
interest1,2 and increasing practical importance because, in many scientific areas, data are being generated much
faster than they can be effectively analysed. We describe a physically implemented robotic system that applies
techniques from artificial intelligence3,4,5,6,7,8 to carry out cycles of scientific experimentation. The system
automatically originates hypotheses to explain observations, devises experiments to test these hypotheses,
physically runs the experiments using a laboratory robot, interprets the results to falsify hypotheses
inconsistent with the data, and then repeats the cycle. Here we apply the system to the determination of gene
function using deletion mutants of yeast (Saccharomyces cerevisiae) and auxotrophic growth experiments9. We built
and tested a detailed logical model (involving genes, proteins and metabolites) of the aromatic amino acid
synthesis pathway. In biological experiments that automatically reconstruct parts of this model, we show that an
intelligent experiment selection strategy is competitive with human performance and significantly outperforms,
with a cost decrease of 3-fold and 100-fold (respectively), both cheapest and random-experiment selection.
 Functional genomic hypothesis generation and experimentation by a robot scientist, https://www.nature.com/articles/nature02236
Learning the grammar of smiles-representation is not the same as learning to use inter atomic distances.
Disclaimer: I have zero experience in drug discovery.