1) As someone who does NLP and AI for Evidence Based medicine [1] I can only agree. I build natural language processing tools, together with the team, for EBM. I have /no/ idea what to do next. Please. Help me.
2) Rant: here's the thing, we have SVMs, LSTM, BiLSTM-CRF, Knowledge Graphs, 28M articles, deploy infrastructure, tests, you know all the cool stuff. We can't sell it. It's PubMed this, PubMed that. There's no breaking that spell. So instead of building a better search engine, we focused on better analytics. Is it cool, sure. Is it as good as all the KoL stuff from LexusNexis, nope. Medicine is /hard/, if you think your app is hard: medicine is /harder/. Am I trying to cover my ass, sure. But seriously, deal with everything being non-standard and proprietary for long enough and you'll feel the same way. 80% of my job is saying "no" to people who aren't going to pay in the first place (Sanofi, Novartis, etc). They don't care. I'm just going to ride this out, throw some blockchain in somewhere where it makes absolutely 0% sense, call it a day. Seriously, NLP/AI for medicine: it's crap 100%.
And, I'm a "senior developer" with publications behind my ass on this. I don't know what to do instead of riding the hype train. Seriously: anyone out there that knows what to do with 100M+ medical publications in a 100M+ knowledge graph, be my guest: I'd hire you in a heart beat. But it's hard and there is nothing to be gained AFAIK.
Imagine you were given raw data from the large hadron collider and you were asked to find interesting things in the data. What would you do ?
Any serious scientist would see the folly of trying to run only machine learning algorithms to find the Higgs boson. What you would need (in addition to ML algorithms for processing) is a good theory of what you're looking at. This also means just applying computer science methods to the problem isn't going to work, you need to inject theory from physics.
For drug discovery this means you need to actually do biology at some point in order to make progress. ML strategies involve tight loop around trying different models, implementing them, and checking to see if it does better than other models. What's needed is then good experiments to test your models and good models to make novel predictions.
There is no getting away from wet lab experiments in biology if you're going to make any significant advances. The theory just isn't there to do purely computational work.
Exactly, and that uncovers the folly in this: what are you trying to do. Often enough it's very specific, and requires actually going out into the world. Sometimes the literature is already there. But nearly always it requires some form of analytical thought. "I want X because I need to prove Y". There is no AI magic sauce around that. It means that the people interacting with the "user friendly" interface need to know, and need to know how to get there somehow. That's rarely the case. And from a product perspective it makes things very hard: you get bombarded with very specific requests that require time and effort to build UIs and models for, that at the end the of the day only solve that specific question. Instead, it would be better to equip people with the tools needed to solve the meta-question so to speak, but that meta question is much much harder (at least, I've been trying but no success yet). As a funny anecdote: my medical knowledge had exploded. But I see that as a symptom of the broader problem: I know how to do /some/ things because I've gained that domain knowledge, but I can't teach the system yet to solve the "general" problem because a) I don't know what it is b) there is no data to get there c) what even is the UI for that.
You're being asked to use AI to replace an entire process all at once, rather than to assist with or replace a step in that process.
Instead of being asked "how do I solve this problem" it would be nice if you were asked "can you replace this thing I have to do all the time?"
Maybe this doesn't apply to your work, but if I were in your position I'd talk to the human that does the human version of the process and break down everything they do to get from A to Z into steps. If you can start by just replacing or augmenting a single step, you're adding value right away and you've also built a part of your eventual A to Z machine, right? For example, your search engine idea is a good direction. Can you automate the process of finding relevant articles? That would help the humans but would also help whatever AI you're working up to, right? Then maybe further in there's some other process involving physics modeling or something that is clear enough in scope to do programmatically.
I do much simpler work, but I find it's much easier to automate processes one step at a time rather than all at once.
Well, that's because physics has a model. Biology has nearly no model at all, so people first look for features at the data, and then verify those empirically.
As someone involved in the field, I agree on the need for a better theory, but disagree with your claim that "there's no getting away from the wet lab". The traditional wet lab absolutely has to be gotten away from, because there are huge issues with transferring knowledge gained on organism X onto organism Y.
Reason? Lack of theory. We don't understand enough about what causes something to work in one case, but not another.
There's just no getting away from theory, in the end.
Perfect answer. This is exactly the case, from Swanson's seminal text mining in biology paper on fish oil in the 80's with Arrow, to modern tools like Leach's Hanalyzer: You need to focus on a very narrow target application or have a clear hypothesis you wish to prove. And, wetlab work, human intervention, and tons of labor are major ingredients of those (rare) examples that go beyond information retrieval or very targeted information extraction. (Also note how none of these guys was able to scale that work and produce a whole chain of findings... ;-))
I do AI-related analysis inside the pharma world, on the image/data side. While I employ numerous traditional analytical methods to extract info from images, few are as sexy as deep learning (though that approach is rising in day-to-day utility in bio-imaging) or most ML/PR approaches. How much ROI benefit is there to devising a higher level of performance or adaptation? And do we have enough data to train from?
Often it's more productive to simply throw in a bunch of bias early and use a closed form method that's familiar, sufficient, quantified, straightforward, and interpretable. This becomes all the more likely when the quantity and quality of training data is limited or variable. Scientists we serve aren't especially keen for us to play around with cool new toys when they're on a tight schedule.
That said, there do exist a number of problems that are so noisy or contain so many signals that finding some sort of low bias method that's also domain agnostic could be a big win. Fortunately these kinds of problems are famously hard (like protein conformation prediction, or 3D shape modeling) so if we can make even a little improvement over the status quo using DL/ML, our scientists are usually willing to put up with the higher costs (and often lower punctuality) in our development of novel learned methods.
Often scientists want to play around with new tech too. Working with them to exploring novel approaches (like deep learning) to serve their projects' "stretch objectives" is sometimes an opportunity for us both to get ambitious and try out something new.
I'm surprised by this, since I was coming in here to say "AI in bio research is mostly a dead-end, except perhaps for NLP paper-mining stuff." I can see how medical applications would be tricky (since the field is covered in regulation from the 19th century and doctors have a deep-seated fear of computers) but as a (roughly speaking) biologist I would love to have a program that would let me ask high-level conceptual questions and get answers from the general mass of academic literature. Or, given some experimental result, find equivalent past experiments that corroborate/negate the result. Etc. Things that seem plausible with modern NLP and a fairly conservatively-written corpus, but I haven't heard of tools that could do this. (Even Siri style "here's what I found for your question, now look through these and see if they make sense" would be abundantly useful, if it's sifting out the right ~10 papers most of the time.)
Every scientist I know has a desk covered in papers (presumably, the select few that were important enough to print out) and even for experts it takes time to glean information from densely-written things like that; there's definitely a problem there to be solved.
> I would love to have a program that would let me ask high level conceptual questions and get answers from the general mass of academic literature
AI researchers too would like to have such a tool. It's getting hard to keep up with the torrent of papers. But such a tool would require actual understanding, thus, we get to have a subproblem which is more difficult than the original problem of finding information in text.
Designing a program that can read (all) papers and understand their content conceptually, then comprehend the question it is asked, and finally reason about all that to generate a sensible output is equivalent to creating an AI (a real one, not the marketing bullshit type we've been fed for the last few years). I don't understand why people even imagine this is possible with current technology?
I don't agree with this. Have you not seen the work by Regina Barzilay and David Sontag at MIT? His paper about embeddings of categorical data across 4.5million medical records?
You can build an entire CDS from the data, which is publicly available.
And if you can't find a problem to solve, then why don't you ask where the pain is? There are a _ton_ of low hanging fruit items that NLP and Deep Learning that can help with _today_ in the clinical environment. How about:
Question-answering for doctors that immediately need to find time-critical facts from a patients medical record? My sister, a hospitalist, begs for help in this area when treating seriously ill patients that come into the ER from different hospitals.
How about helping insurers better model their patient populations from unstructured data?
ICD coding for medical billing, our extremely wasteful 16-20billion dollar a year industry with _terrible_ NLP solutions in the current market?
There are problems to be solved. You just need to ask your customers, and find their pain.
You're not addressing the problem area the author is looking at though, which is lead generation (finding new molecules that either have high binding affinity to a receptor and/or might be useful for a given medical application).
You call your problems "low hanging fruit", yet they are real problems that multi-billion dollar enterprises are facing and know of. If they really were as "low hanging" as you say, don't you think they'd have the resources and capacities to already have solved them? (Looking at you, Watson...)
They are, but you have to realize that the companies themselves are focused on healthcare first, and reimbursement. Healthcare in the United States is a fee for service model, not a value based one. So the system isn't aligned for it. But they are doing it, look at the sepsis NLP solution created by Mass General in-house.
A bit removed from drug targets, but have you seen AMELIE - it's for ranking causal variants (genetic mutations) for patients based on literature + patient's phenotype?
1) What short-term goals are you trying to achieve ?
Recent AI advances have made it easier to transform unstructured information into structured one. But you'll still need to do some boring correspondence dataset construction. There is great value for a better research/cataloging of publication data (but probably not many people will pay for it).
Find what the users do with the data and make it easier for them using modern tools.
2) Medecine is a legal minefield. There is definitely a place for AI to cure illness and improve health. It's hard to get the data, and to standardized in it shape, but I'm convinced that just by the mere fact that it will have more personal data and vastly more general knowledge that AI generalist doctor will trounce any generalist doctor even with simple predictors. The real value is not to cure you when you are ill, but preventively to keep you healthy by continuous monitoring and adapted personal recommendation. Look at performance enhancing equipment for athletes optimal training for example.
There are plenty of things you can do with your graph database (probably more with the metadata than with the data) which can provide value, you can use it to make recommendations of paper to read (like arxiv sanity). Use it to connect researchers, for example if one of your co-paper author work with another researcher then you can suggest a work relation. You can cluster researchers. You can rank researchers according to a multitude of criterion (like an ELO score for researcher). You can create recommendation so that you reduce isolation between clusters and promote information dispersion.
If you're a basic research scientist, you can't just rely on PubMed, simply not everything is indexed there.
Google Scholar is pretty much the only place you can go for that sort of thing, and it's distance from things like Microsoft Academic Search are leagues.
I'd kill for a non-Google academic search, and smarter Google Scholar/PubMed alerts would be appreciated.
I think there is a lot to be gained by creating better systems for searching and synthesizing diverse pieces of biomedical knowledge. But indeed this is mostly a field of research, rather than something that could be directly marketed. Did you think about going (back) into academic research?
Absolutely, always wanted to finish my Epidemiology PhD. But eeh the pay is awful. Not that I /really/ care, but that's getting into personal details. Still, whatever I could come up with now would not be much different than what I would come up with during academic research. The machine in my head is working regardless.
Based on the link you provided, it appears to be a cloud tool that allows someone to generate end reports comparable with uptodate.com. Is this the case?
A domain specific vertically integrated search engine seems like a great idea but probably requires being bought by Bing or Google to incubate while both the world and the technologies develop.
You probably already know, but there are teams with the national library of medicine that work specifically on natural language processing within the scope of pubmed, but also on new ways to get to the publications. I almost interned there for a summer, very fascinating stuff.
yep, we're trying to outcompete them. But it's a hard thing. MetaMap: we beat it, cTakes, we can do it faster. But really how many < 80ms concept recognition systems do you need that work on 28M documents. And if you have that: what are you going to do with it. We have, internally, systems that predict dense vectors based on CUI graphs of concepts in text. Great for similarity search: but utterly useless in the end.
Yeah, I run into this problem all the time in the space. You build something really awesome, an either doctors don't care, it doesn't fit the workflow, or it turns out to be minimally useful in practice. We abandoned most NLP efforts and dump tagged terms into algolia for a lot of use cases and it works well enough to fit a prescribing workflow. The docs already know what their searching for, they just need to know its in the catalog and like the autocomplete.
If I might suggest something, maybe forget about targeting older doctors and researchers set in their ways. Go after the youth, the ones just starting their education in research/medicine, you'll find us far more pliable.
I don't think you need to out-complete these systems in processing speed. Even if you can only do 10 documents/hour, that probably can be scaled to handle a few million. Where you need to out-complete cTakes & co (on a large margin) is information extraction quality (to have a viable commercial product). And that is very difficult...
As someone who has worked on AI for drug discovery, I would say that the title is correct, but not for the reasons stated. There is also some annoying speculation in this document that is completely incorrect, and unsupported, so I caution people when reading it.
Anyway, the primary reason that AI for drug discover is overhyped is that the sort of problems AI is good at solving don't line up well with the unsolved problems in the drug discovery pipeline.
This article, for example, focuses a lot on lead generation. Lead generation is the easiest aspect of the problem to tackle using AI, and so most people doing research start out trying to build a foundation in this space. However, it doesn't actually represent the majority of the cost.
Drug makers typically spend about ~$800M on failed drugs for every ~$900M in revenue. They aren't spending that $800M on leads, finding leads is fairly easy. They are spending that money on drugs that fail in Phase 2 and Phase 3, which is more about off-target side effects, bulk formulation and synthesis, patient population differences, drug-drug interactions, etc.
It would be nice having better leads, but there aren't a shortage of them that look good in vitro or even in vivo. It isn't until much later in the pipeline that the costs really add up, and failures there are expensive. If we could solve off-target side effects using AI, then we'd be in a whole different ballgame. Having banged my head against it for a while, I think it is possible, but will take a huge amount of investment.
The work this article talks about is more foundational, which is necessary but should not really be taken as anything more.
for those not familiar with where the money really goes in R&D, and where the biggest opportunities for improvement are, check out these charts [1] and [2] from the seminal paper on calculating the cost of getting a drug approved [3].
the biggest areas to improve R&D productivity lie in 1) picking more validated targets to reduce Phase 2 and 3 failure rate and 2) reducing the cost of lead optimization (basically the process of turning a compound
that has the desired impact on the target (target is a molecule implicated in disease that you want to effect with a drug) into a molecule with drug like properties (ie gets where you need it in the body, is safe, can be manufactured and delivered efficiently))
im not an AI person, but could AI play a role here? there are plenty of validated targets that are "undruggable"; could AI help find as-yet-undiscovered molecules that could engage these targets? or could AI somehow make the med chem / lead optimization process easier?
The primary missions in drug development are efficacy and safety. AI can help answer clear well-formed efficacy questions like, "Does this molecule fit a chosen target molecule"? But it can't help with bigger efficacies like, "Will hitting this target make a sufficient difference in managing this disease"? Or any safety questions like, "Does the molecule also hit any other molecules somewhere in the body (or population) that might screw up something else"? And unfortunately it's safety failures that eat up 90% of drug development costs (esp. in Phase III).
Until Wall St (or Sand Hill Rd) understands that domain-agnostic low-info approaches like AI are incapable of answering complex questions that require teams of PhDs steeped in decades of doing both chemistry and biology, the notion of CADD will continue to miss the mark and waste megabucks.
Well, as far as I know the state-of-the-art results on current drug toxicity prediction benchmarks (e.g., Tox21) are held by deep neural networks; it seems like recent AI advances HAVE proven useful in that regard.
> If we could solve off-target side effects using AI, then we'd be in a whole different ballgame. Having banged my head against it for a while, I think it is possible, but will take a huge amount of investment.
Can you talk more about what you've been thinking about here?
Not sure what OP is thinking, but you can look at this example of a commercial product designed for the prediction of off-target effects (https://cyclicarx.com/ligandexpress/).
1) SAAS in the pharma world is mostly a waste of time. Culturally, they don't want to pay for anything except drugs. There is also a culture of sunk cost, where they do not want to prune drugs from their pipeline based on what some piece of software says.
2) This is a boil the ocean approach, which does not work statistically. There are 20,000 targets. Predicting bioavailability at each target is very difficult, and different populations have different expression patterns. Even if you have 99% precision/recall for each one, odds that you can help with selection enrichment are infinitesimal. Even if you restrict it to a handful of targets with strong known side effects, the state of the art predications are still not good enough to meaningfully improve the outcomes.
I'm not an expert in this space, but lead generation doesn't seem like a fundamentally bad place to do this - if you could accurately rank leads this would deal with the failed drugs - it just hasn't achieved the results people would like.
"~$800M on failed drugs for every ~$900M in revenue" Do you have a source for this? Looking at big Pharma the marketing budget is at least 2X the R&D budget so the numbers really look off from what you are posting?
Most of the actual R&D is done by smaller companies, and then the large companies (e.g. Pfizer) buy up the compounds in Phase ~2. They do have some of their own development, but it isn't the majority for most players. They'll primarily take drugs through Phase 3, then deal with synthesis, distribution, and marketing.
It is hard to look at a single company to see how the money is spent on failed drugs.
The fundamental problem with AI (here meaning, neural-network based applications) applied to biology is that pattern recognition doesn't get you very far. When designing a small molecule drug, for example, the goal is to find a molecule that will slot into some protein (or other biomolecule) of interest in just the right way; whether it does so is dependent on shape in complex and impossible-to-predict ways. As an analogy, take the task of making a key (drug) to open the front door of an apartment across town (disease-relevant target); your "training data" (known drug-y molecules) are the keys for a bunch of other apartments sampled mostly randomly from around town. Obviously, you can make any number of plausible key-oid objects that would pass visual inspection as potential keys to the target, but even a godly-intelligent post-Bostromian superAI couldn't solve the problem in any principled way. You just need information about the actual lock you want to open, plain and simple.
There are many, many, many other open problems in biology, of course, ranging all along the conceptual gamut, so I can't say there's nothing where AI would be a killer app. And there's also the sort of last-ditch null argument that, well, being smart helps humans do biology, so computers that are smart must be useful somehow. But typical progress in this field is from patiently acquiring new data. There's no point where you know "enough" of the picture to infer the rest with any degree of certainty- the degrees of freedom in the "design" of biological systems is simply too huge.
I’m not sure this is a great analogy. If we didn’t know anything about the locks (target binding sites) I would agree with you, but there is no reason to enforce that restriction. A pipeline that given a target binding could predict a number of potential drug structures with high accuracy would be extremely useful. Going with your analogy, this would be like training a net using a database of lock structures and their corresponding keys, and then having it predict a key for a given lock. That seems fairly doable for current ML techniques.
Obviously actual drug design is more complicated than this, and you have to consider side effects. But I don’t think it’s as hopeless as your analogy.
> Going with your analogy, this would be like training a net using a database of lock structures and their corresponding keys, and then having it predict a key for a given lock.
If we have that, then the problem is entirely trivial- just look at the lengths of the tumblers, and give the key teeth that match. And so it is, with the corresponding increase in complexity, in biology- given a complete and confident structure of the target, we have (non-AI, reasonably reliable) chemical modelling methods to see whether and how a given compound will bind; iterate molecules until you find one that works and you've found your "key" quite cheaply. So, there's another limitation to the practicality of AI- when it's not irreducibly complex and intrinsically unknown, biology is usually mechanistic and deterministic enough that conventional approaches can reason about it quite effectively without resorting to anything so fancy as a neural network.
Biology is almost always more complicated than a given description, so yes, the real problem is more complicated. As you mention, the key we create must not only fit the target lock, but it must also not fit any other locks that might cause problems. Perhaps AI methods could try to make a working key that is un-key-like as possible, to make it unlikely to fit off-target locks; potentially useful. But, absent knowledge of the various perverse configurations that locks in the wider world can have, the confidence the AI method's inferences are very limited- roughly speaking, it's attempting to make a classifier for a population that is mostly unknown. We might hope for a severalfold increase in the number of successful drug candidates, in the best case, but not a fundamental disruption of the pharmaceutical research process.
> given a complete and confident structure of the target, we have (non-AI, reasonably reliable) chemical modelling methods to see whether and how a given compound will bind; iterate molecules until you find one that works and you've found your "key" quite cheaply.
I am a researcher in the computational chemistry/biophysics field, and from what I can tell this is not the case (would love to be corrected if I am mistaken; my research is not directly related to drug design). There are various computational ways to to test whether a molecule will bind to a target site, but they are either too inaccurate or too computationally expensive to claim that this problem is "solved" by chemical modeling methods. Pharma companies are still wasting money sending drug structures that have been validated by computational methods to organic chemists to synthesize that end up not binding as expected.
AI can't help much in drug discovery (i.e lead generation) because the pharamecutical industry is suffering from Eroom's law ( https://en.wikipedia.org/wiki/Eroom%27s_law ). It's a play off Moore's law (spelled backwards), but the fundamental problem is the pharma/medical industries do not understand biology from first principles, unlike hard physical sciences, such as computer hardware engineering.
First principles understanding of computing hardware allows chip manufacturers to create novel, new chips from scratch, as we fully understand the physics behind electricity, how logic gates work and signals propagate through physical mediums. Lock most college students in a room with resistors, diodes, wire, etc, and reference material, and they could recreate basic circuitry and very simple computers. You can not do the same with biology - no student or expert, given unlimited resources, equipment, or reference material can construct a new living cell from scratch (proteins / molecules). This is not a slander of those sciences and scientists, but there is simply a huge gap in how well we understand the basic principles of life and how well we understand the basic principles of computing.
Throwing more resources / computing into "drug discovery" is like trying to build chips by wiring computer parts differently. Occasionally it might work and produce a "useful" result, but it's fundamentally a broken approach.
It's fine to make approximations to avoid exponential scaling, but applying function approximators essentially randomly won't get you anywhere. This is then compounded by the fact that the functional framework you're starting from is not a first-principles approach.
Until there is QMC for drug discovery, it will all be hype.
Mostly hype. Yes, automating drug discovery to any extent is utterly hopeless, and as likely to impact the pharma business any time as autonomous killer robots overrunning the battlefield -- Not In My Lifetime.
But AI definitely has a near term future in addressing well formed questions like specific assays or searching for well-constrained targets, like ligand matches. The trick is for the AI contributor TO LEARN SOMETHING ABOUT THE DAMNED DOMAIN. Unless the chemist/biologist is intimately involved in the task, the AI provider is shooting blind. But with many wise eyes on the ball, even the hardest problems becomes a lot more assailable.
[I say this as someone who processes images and analyzes data within a big pharma, and has seen several grand IT plans fail (like systems biology disease modeling) and many small & specific scientist-assistance tasks succeed.]
>You can not do the same with biology - no student or expert, given unlimited resources, equipment, or reference material can construct a new living cell from scratch (proteins / molecules).
The last paragraph of that article proves the OP's point, you can't do it by working from first principles.
"Even Venter acknowledges that syn3.0’s genome, although new, was designed by trial and error, rather than being based on a fundamental understanding of how to build a functioning genome."
This is why I was never good at biology and hated both classes the classes I took in it. Coming from a CS background, it was so frustrating not to be able to reason from first principles about the nature of our observations.
Responding to his issues with Vijay Pande's work (I'm not affiliated), graph convolutions really are better than char-RNN for this. There's a good theoretical motivation for why, and people have spent a lot of time trying to find better embeddings of chemical space. (Admittedly, still not great - I wouldn't dispute the overall thesis that AI in drug discovery is still very early.)
I did a project a year or so ago to reimplement one of Aspuru-Guzik's papers, a variational autoencoder, (https://github.com/maxhodak/keras-molecules) and when I did that I compared to char-RNN and the VAE did get much more interesting results. I also saw results from other people around that time showing that using graph convolutions on the front end instead of one-hot encoded SMILES strings was even better.
Also, as for OP's objection about GANs not working because it's a "perfect discriminator," this is an obvious result that becomes apparent after spending 20 seconds with the problem. (Eg, this thread here: https://github.com/maxhodak/keras-molecules/issues/55) I haven't read the Harvard paper referenced but I'd be absolutely shocked if this was lost on them. There are definitely ways to work through it.
I am not sure to fully understand your remark, but if you have a benchmark graph convolutions vs. char-CNN, it would be great to write your result and post it on Arxiv. Pande will be interested ;) The problem is not with the theoretical motivation, but with the empirical confirmation.
I also agree that there are many ways to work through the perfect discriminator problem for ORGAN. But it remains to be done (afaik).
I misread char-CNN as char-RNN in the linked article. I don’t know of results that compare graph convolutions to char-CNN. If a comparison were to be made, what metrics should it focus on?
For the most part, I agree with the article's premise.
But they were pretty unfair to the Stanford lab (Vijay Pande). The Stanford team is using graph convolutions to process molecular structures. The article complains that the Stanford team is biased because they are not using sequences to represent the molecular structures. But this actually makes a lot of sense because molecular structures are graphs. The article then goes on to say that the Stanford team has an agenda to get everyone to use deepchem instead of TensorFlow or PyTorch. But TensorFlow is a dependency in deepchem, and deepchem looks like a bunch of useful tools wrapped around TensorFlow.
Agreed. Just because you don't compare your model to another model doesn't mean you're hiding something and it's irresponsible to pretend that's the case. Perhaps the author should have run char-cnn against data instead of making baseless accusations.
On the whole, a lot of the arguments are realistic of ML in general. There is far too much focus on individual problems and the accuracy measure without enough focus on the interpretation of the accuracy, how well it will really generalize, etc.
In general, you are right. But in this particular case, char-CNN is the standard used by many people, Stanford included. It is not sophisticated at all.
For me, this omission is a negligence. It is selective laziness.
The author (me) does not have 20 PhD students, postdocs and startuppers under his hand, to do the job for Stanford. With my limited resources, it is more cost-effective to do my job against Stanford ;)
This professor gave an argument in his paper. I gave a refutation. If you agree with him that char-CNN was a sophisticated model in early 2017, then you are not well-informed about the situation in deep learning.
The MoleculeNet co-author gave another argument in the comments. My refutation is that you can't claim to lack time after 8*20 man-months have passed.
1. I agree that graph convolutions make a lot of sense. However, it is not shown yet that they are better than SMILES, although it might just be a matter of time. A lot of graph variants are possible, or maybe we should look at molecules not as graphs, but as quantum objects (See the other Stanford paper on atom convolutions).
2. Yes, Deepchem is built on top of Tensorflow. This additional layer looks useful, but is it really the case? Using standard NLP models is simpler than using deepchem models, and NLP models might still be the state-of-the-art for chemistry tasks, via SMILES (until we get a reasonably comprehensive benchmark, which MoleculeNet is not).
I am not buying into a chemistry-specific library, until it is shown to be really necessary.
DeepChem developer and MoleculeNet co-author here.
We are NOT trying to get vendor lock in for users of DeepChem. If there are complaints about lock in using our tools please file a github issue and we can work on trying to create an open easy to use API for everyone.
From a user viewpoint, Deepchem would greatly benefit from being a better team player with lower-level (Tensorflow) or other (Pytorch) frameworks. The pace of research (in NLP in particular) is too fast to make it realistic to port everything in Deepchem without an unreasonable delay.
Does it fit the Deepchem agenda? That's another question ;)
Thanks for the feedback (I love feedback). While that is a good high level goal the devil is in the details of how to make it happen. Here are some doable ideas in the medium term which might help.
1) Tutorial and documentation on how to access the raw Tensorflow graph when using DeepChem
2) Tutorial of combining raw Tensorflow with our existing chemistry specific layers
3) Different documentation quick-start sections for ML practitioners and application practitioners.
4) Better overall documentation of our Chemistry Specific layers.
Would these ideas have made a better first user experience for you?
This comment is related to AI in healthcare services, not drug discovery specifically, but is illustrative. I recently went to a talk by a sr exec at one of the biggest health systems in the country. They were evaluating AI tools to predict which patients would die in 18 months so they could design the most patient friendly and cost effective end of life plan. None of the algorithms they evaluated outperformed simply asking a doctor which patients she thought would die in the next year
Of course, "these days" (scine ~2014 by my reckoning) AI is synonymous with Deep Learning (in the lay press anyway) but there are actually other techniques still being researched, specifically theory- and knowledge-driven techniques, from the field of symbolic machine learning.
Such techniques have long been used in medicine and biology with great success, very notably in the example of the Robot Scientist, a system that automates the scientific process end-to-end, in a biology context.
This is from the abstract of the Nature paper, from January 2004 [1]:
The question of whether it is possible to automate the scientific process is of both great theoretical
interest1,2 and increasing practical importance because, in many scientific areas, data are being generated much
faster than they can be effectively analysed. We describe a physically implemented robotic system that applies
techniques from artificial intelligence3,4,5,6,7,8 to carry out cycles of scientific experimentation. The system
automatically originates hypotheses to explain observations, devises experiments to test these hypotheses,
physically runs the experiments using a laboratory robot, interprets the results to falsify hypotheses
inconsistent with the data, and then repeats the cycle. Here we apply the system to the determination of gene
function using deletion mutants of yeast (Saccharomyces cerevisiae) and auxotrophic growth experiments9. We built
and tested a detailed logical model (involving genes, proteins and metabolites) of the aromatic amino acid
synthesis pathway. In biological experiments that automatically reconstruct parts of this model, we show that an
intelligent experiment selection strategy is competitive with human performance and significantly outperforms,
with a cost decrease of 3-fold and 100-fold (respectively), both cheapest and random-experiment selection.
There's more info and especially links in the Wikipedia article [2].
Was jumping on to say exactly that. We are most likely somewhere close to the "Peak of Inflated Expectations" for the current iteration of AI. It's useful, just not a magic bullet to solve all problems.
I'm not qualified to discuss drug discovery, but my eye was caught by the authors statement that char-RNN on SMILES strings might outperform the use of a coulomb matrix. IMO this is really misguided as a SMILES string does not contain specific geometrical data (bond angles, dihedrals etc) of a compound...
Learning the grammar of smiles-representation is not the same as learning to use inter atomic distances.
SMILES implicitly stores the geometry data. It's technically true it doesn't encode a unique 3D structure (for a rigid molecule), but you could convert the implict bonds to standard-length ones, embed the molecule in 3D space, and minimize it (that's precisely what SMILES to 3D structure systems do).
I don't know about "any existing technology" but in any aspect of drug discovery with which i'm aware, i'd agree. My job amounts to using statistics and linear programming to winnow drug candidates out of databases of hundreds of millions of small molecules, (sometimes: "chemoinformatics"). And i've lost count of how many times i've had the marketing department insist on calling what i do "AI". (and let me assure you, as far as pushing back against marketing is concerned "all resistance is futile")
I don't know how it can help train whatever 'CNN' stands for (i'm guessing not the news network, but something Neural Network?) Yet there are vast sources of various ("pictures") representations of small molecules all over the internet; from line-drawings, 2d stick models[1] to those providing rotation to space filling models[2]. As a very personal aside, neural networks are just a means of doing curve fitting without necessary learning anything.
It stands for convolutional neural network, which is a NN architecture that works well on image data. They use CNN's in object detection and other computer vision related tasks.
Anyone who has actually built software using what the masses think is 'artificial intelligence' knows exactly how artificial and overhyped ML and AI are. Neural networks are a far cry from the magic they've been made out to be, and are nowhere close to as sophisticated as the brains of most animals. Cherry picking results makes for good marketing material, but that's pretty much where the usefulness ends.
We still don't even understand how exactly brains work.
Neural networks really are pretty magical for a lot of image based problems these days. The impact for text problems is so far decidedly not magical, though there are starting to be some tantalizing results in my opinion.
I'd say this is more of a result of decades of trial and error in different architectures/topologies rather than some fundamental property of neural networks in general.
What about Alpha Go Zeros approach of reinforced learning based on self play? Arguably the game of go has really simple, explicit rules, and evaluating the final end state is also easy, but it shows that with enough compute NNs can identify important features without supervision or labeling.
Arguably the game of go has really simple, explicit rules, and evaluating the final end state is also easy, but it shows that with enough compute NNs can identify important features without supervision or labeling.
Given enough time, energy, and computing power, you could just precompute every possible outcome and call it a neural network. You'll be praised as a genius.
Ah yes, I forgot. Deepmind just secretly computed all 10^100 board states for alphago...
Neural nets have been very successful for a lot of things while also being overhyped, but dismissing it all as marketing hype is throwing the baby out with the bathwater.
This is like saying "given enough time, energy, and computing power, you could just factor large semiprimes by exhaustive search." It's both true and uninteresting because nobody has or ever will have that much time, energy, and computing power.
Where "enough" is "more than is available in the universe". The number of outcomes you'd need to precompute exceeds the number of atoms in the universe.
In certain cases, perhaps. But that's the exception, not the rule. You can't simply apply a neural net to every problem and magically get great results.
2) Rant: here's the thing, we have SVMs, LSTM, BiLSTM-CRF, Knowledge Graphs, 28M articles, deploy infrastructure, tests, you know all the cool stuff. We can't sell it. It's PubMed this, PubMed that. There's no breaking that spell. So instead of building a better search engine, we focused on better analytics. Is it cool, sure. Is it as good as all the KoL stuff from LexusNexis, nope. Medicine is /hard/, if you think your app is hard: medicine is /harder/. Am I trying to cover my ass, sure. But seriously, deal with everything being non-standard and proprietary for long enough and you'll feel the same way. 80% of my job is saying "no" to people who aren't going to pay in the first place (Sanofi, Novartis, etc). They don't care. I'm just going to ride this out, throw some blockchain in somewhere where it makes absolutely 0% sense, call it a day. Seriously, NLP/AI for medicine: it's crap 100%. And, I'm a "senior developer" with publications behind my ass on this. I don't know what to do instead of riding the hype train. Seriously: anyone out there that knows what to do with 100M+ medical publications in a 100M+ knowledge graph, be my guest: I'd hire you in a heart beat. But it's hard and there is nothing to be gained AFAIK.
[1]: http://growthevidence.com/joel-kuiper/