Hacker News new | past | comments | ask | show | jobs | submit login

Word of advice to all those who are chomping at the bit to disrupt pharma with AI.

A pharmaceutical company that heavily leverages computation is called a pharmaceutical company. All modern pharmaceutical companies heavily leverage computational tools - including some powered by deep learning.

Any company building a computational platform to accelerate drug discovery / development is not a pharmaceutical company. They are a software company who wants to sell software to pharmaceutical companies, which is a terrible business to be in.

The product the author is selling already exists for free. All of the big knowledge bases use automated ML tooling for curation and extraction. And the people who run them and QC them are world class experts in their domains. I mean, just take a look at uniprot.[0]

And the only types of pharma companies who would want a big knowledge graph would be the large ones with active programs across multiple therapeutic areas. Most of the small companies / start ups tend to be focused on getting one or two assets to market in a therapeutic domain. And the big companies possibly have their own ML teams doing literature extraction - because a team of 5-10 FTE ML / software engineers is like a rounding error in their R&D budget.

The other thing is that the most valuable knowledge is the stuff that is not in the literature. That’s why we do experiments.

[0] https://academic.oup.com/nar/article/51/D1/D523/6835362




Ingenuity's Pathway Analysis product has been the standard tool big pharma has used for 15+ years. Founded by folks from Stanford, they created their own knowledge-graph database a decade before they were popular, and have been paying PhD's to curate that database with genetic findings from papers over the last decades. Obviously their analysis also includes the publicly-available databases, and they have their own private genomic databases and expertise from helping many of the first-available genomic services go live.

Still, it's a tough business, and they were bought by Qiagen (to make an end-to-end solution, with moderate success, as the market is strangled by Illumina's control over NGS sequencing). And with exponential growth in information, their early relative advantage likely has waned.

Also note Veeva software, which provides infrastructure for pharma, is a public benefit company, legally devoted to its clients. That's how much power the customer has.

That said, there's something of a revolution in drug discovery as more is being done by small companies with ultra-focused expertise. If they get a viable candidate, a bigger company buys it to marshal through approvals, and then it can be sold again to a marketing company. Focusing on these pop-up drug-discovery companies that evolve out of someone's PhD could be a good niche; I would expect Pharma VC's to prefer funding companies using their favorite computing vendor, for reliability if not insight. Here computational discovery assistance would help a bit, but validating findings and science could be golden.


I think the big case study everybody gives is Schrödinger who took 20 years to IPO.

The problem is that the market is just not that big. Assume that every single pharma company buys your product - like Schrödinger - where does your revenue top out?

You can beat that as a small pharma company with one or two assets that make it to market. So the question is: if the computational platform is so good then why not just be pharma company?


For a good computational platform, that can pinpoint unique and valuable candidates , isn't there an option to do license/consult in exchange for royalties ?


No. Makes more sense to license the asset (molecule). And even then, big pharma companies won’t talk to you until your molecule is in phase 2.

If anybody with an AI pharma start up tells you they plan to make money by licensing their platform, then run the other way because it is a red flag.

If you want to see what a real “AI” pharma company looks like check out Vertex pharmaceuticals.


“If you’re so smart, then why aren’t you rich?”


I wouldn't be so quick to dismiss a niche for third party data analysis.

A very common feature across verticals is that in house data analysis is siloed.

While it's a difficult sell, getting multiple siloed data sources to agree to 3rd party analysis for shared gains can be extremely successful for everyone involved when pulled off.

So looking specifically at pharma, while analysis of published papers is meh, an often discussed component of research is the bias against publishing negative results.

So a hypothetical product like using ML across multiple firms' data of failed products and research in order to establish a model that could more quickly identify dud research avenues by leveraging industry data could only exist with the broadest dataset as a 3rd party product and would deliver gains that could be quite profitable for some of the largest companies out there.

Also, I think anyone who has worked with larger corporations on in house tech knows that even if the individual efforts are quite large and sophisticated, the sheer amount of bureaucracy that goes into every single thing at a sizable firm can mean significant advantages for startups vs in-house efforts, particularly when related to fast moving fields.

I'd agree that "moving into a niche without knowing it extremely well" can be fraught with issues and that attempting to get buy in from B2B firms for a startup is a nightmare, but I'd disagree with "large company does X in house so creating a startup to do X is a bad idea."


I mean what you suggested already kind of exists in several forms.[0,1]

Third party data analyses is a thing, but this is often bundled with conducting experiments by CROs.

See my other comment about market size. The problem is that you can make A LOT more money selling drugs than you can selling software to pharma companies. So if your software is any good, then you should use it to make drugs and just be a pharma company.

0. https://www.opentargets.org/

1. https://www.citeline.com/en/products-services/clinical/pharm...


What I suggested was simply an illustrative example off the top of my head. The fact there's multiple similar companies that already independently exist furthers my point.

As for your other point, I'd imagine there's a pretty big difference between the capabilities and infrastructure between bringing a drug to market and analyzing the data of people who bring drugs to market.


Thanks for commenting. I've been thinking a bit about taking some computational insights we have developed over years to bioinformatics.

During my Ph.D., my PI wanted to do some work on signalling pathway analysis. He was interested in knowing when and why sometimes body starts assuming that the sick state is the right state and start acting against drugs given to patients. This making treatment very hard.

I don't know what is status of that but one insight I got from him that most (all?) Pharma companies do not trust acdemic data. I didn't ask reasons but I'd assume because of replication issue (crisis). Knowing how data is gathered and published in signalling domain, I'd not blame them for having low trust in academic data.


Depends on what you want to use the data for. In general, you should take a trust but verify approach to new results, methods, data, etc.


I did a stint at large pharma, and it was not free software that was used to extract structured data from publication.


> I mean, just take a look at uniprot.[0]

this looks like one narrow niche knowledge base. I am not expert in this domain, but there are probably some other use cases not covered by existing offerings.

> literature extraction - because a team of 5-10 FTE ML / software engineers

I think this problem is so hard and open ended, that your 5-10 non-star avg salary ml ftes likely produce very mediocre and likely not usable results.


Uniprot fills a pretty big niche - proteins. Your body is literally made out of proteins. They carry out most of the work and chemical reactions that amount to what we call life, and almost all drugs work by modulating some protein target in some way. UniprotIDs are the canonical identifiers that computational biologists use for proteins.

But they are just one knowledge base. There are others. Each with their own focus area. And they are all associated with prominent bioNLP / biomedical AI research labs and employ human SME curators.

> 5-10 avg salary non-star ML FTEs likely produce very mediocre and likely non usable results

LOL!


> But they are just one knowledge base. There are others.

So, is your claim that all topics/niches/verticals/steps in pharma development and productions are covered by all those databases with absolute quality and perfect UI/workflow? I follow some studies about trials meta-analysis, and my impression is that end result is mainly produced by manual work of some low paid postdocs, which makes them not trustworthy.

> LOL!

I actually claim non-trivial expertise in this area (facts extraction from untrivial niche documents), and my observations is that even SOTA (aka results from top researchers) are hardly useable in real life, because this area is very hard. Could you support your "LOL!" with any references I could check?..


I already gave you a real life example with the uniprot reference. Here is another flagship knowledge base that heavily leverages NLP extraction.[0] Here is another one that gets used in what seems like every network biology article.[1]

Meta analyses? Automating meta analyses is not a real need. They have their place, but it’s like a quaint cottage industry type thing - like custom haberdashery.

Also the most valuable knowledge is not in any publication. If you are reading about it in an article then you are already 2-3 years too late.

0. https://geneontology.org

1. https://string-db.org/


> that heavily leverages NLP extraction

I search https://www.google.com/search?q=site%3Ageneontology.org+nlp+... and don't see anything meaningful.

> Automating meta analyses is not a real need. They have their place, but it’s like a quaint cottage industry type thing - like custom haberdashery.

my opinion is that it has to be top level tool in modern science which could easily sort out lots of bs and contradictions in reported results.


The parent is right, saying a database about proteins 'looks like one narrow niche knowledge base.' in this context is just nonsense.


in which context, and why your bold statements should be trusted blindly?


TL;DR: I explain what proteins are and why they're important, show some biological "flowcharts" , and end up with one "function definition" from the "source code" that makes you you, and has to do with you eating and breathing.

Long:

This is the bit that makes biology awesome to me, so excuse me for the small essay ;-)

Proteins are basically extremely advanced nanomachines which work together in larger systems to ultimately form a cell. Having a listing of all the proteins in a cell (the sum of the parts) is insufficient to grok the whole, but it's pretty darn important. The abilities and limitations determine and constrain what a cell can do, and ultimately influence what organisms and ecosystems are and are not capable of. Which is a big chunk of the science of biology.

Uniprot lists many/all of the genes/proteins that have been decoded so far. It's a bit odd to call that a "niche" in the context of the field of biology.

I'm not going to discourage you though: Proteins and protein systems are pretty darn awesome!

Example:

(using KEGG rather than uniprot, since it's got graphical maps, which is handy to get an intuition)

For instance, if you want to know why you need to eat and why you need to breathe (flowchart for what your cells do with starch and oxygen):

* https://www.genome.jp/pathway/map00500 start by finding starch on this map (gets split into glucose)

* https://www.genome.jp/pathway/map00010 which gets broken down into 2* pyruvate

* https://www.genome.jp/pathway/map00020 which gets processed

* https://www.genome.jp/pathway/map00190 and ultimately "burned" with oxygen.

Each step is a 'chemical reaction catalyzed by proteins'[1] (in the rectangles). You can dig in deeper to find your actual source code: Say we click on a random step (in this case near the top of glycolysis on map 00010)

* https://www.genome.jp/entry/K01810+K06859+K13810+K15916+5.3....

At the bottom you can find the gene listed for Homo Sapiens (HSA)

* https://www.genome.jp/entry/hsa:2821

And this lists the amino-acid (AA) sequence for the protein, and the nucleotide (NT) sequence found in humans. Since this is highly preserved functionality, that's probably (almost) exactly the source code that you have in each of your cells.

KEGG is nice to get an overview of some of the pathways that are fully understood with the maps.

[1] calling it a "chemical reaction" is sort of underselling many proteins. Proteins can have moving parts and can work together. I prefer to think of them as sophisticated nanomachines.


> Uniprot lists many/all of the genes/proteins that have been decoded so far. It's a bit odd to call that a "niche" in the context of the field of biology.

I actually checked uniprot, yes, it lists proteins (probably most of them), but ontology is raither narrow, it has few dozens properties, you can't for example query that DB with question: give me diseases which can be attributed to broken pathways synthesizing protein X, you would need to do a lot of manual work and check external databases of uncertain quality.

Another question is quality of that dataset, why it is so obvious that all those millions of pathways for hundreds thousands proteins are researched and described with 100% accuracy?


> you can't for example query that DB with question: give me diseases which can be attributed to broken pathways synthesizing protein X, you would need to do a lot of manual work and check external databases of uncertain quality.

nod

Uniprot is more useful if you're looking for the actual "bare metal" NN and AA sequences. Which is rather important in its own right, obviously: Sooner or later you DO need the actual sequences if you're going to do something with them in real life.

But uniprot doesn't -itself- give you an understanding of what that code is then doing.


> this looks like one narrow niche knowledge base. I am not expert in this domain, but there are probably some other use cases not covered by existing offerings.

UniProt is not niche. It contains curated information about all proteins across numerous domains.


yes, proteins is a niche in grand scheme of things


The GP was talking about pharma. Proteins are not niche in pharma. Everything else may be niche in that domain, but proteins make up more than 95% of the targets of the pharmaceutical industry.


> The GP was talking about pharma.

post is not about pharma specifically, but about general bio-medical literature, including genes, diseases, symptoms, etc.

> proteins make up more than 95% of the targets of the pharmaceutical industry.

do you have references to support this? Some google search says it is 280B market out of 1.5T total pharma market: https://www.alliedmarketresearch.com/protein-therapeutics-ma... https://www.statista.com/topics/1764/global-pharmaceutical-i...

Also, it is likely important steps how to research and produce protein drugs, but the end target is to cure diseases, so you need lots of additional data about diseases of all types, symptoms, pathways, trials, etc.


This isn't a "do you have references" situation. You're wasting folks time. Nearly all drugs target proteins, with a few that target DNA or RNA.


You can just ignore my comments and walk away? To me you just another "internet expert" which I am not sure why should I blindly trust.


Your sources are talking about the ratio of small molecule vs. large molecule drugs. Even if you're developing small molecule drugs you are likely targeting some aspect of protein signaling/gene expression.

People are being dismissive of your comments because to say that proteins are niche in the context of pharma is like saying advertising is niche in the context of Meta and Google.


> People are being dismissive of your comments because to say that proteins are niche in the context of pharma is like saying advertising is niche in the context of Meta and Google.

its all about how you define word "niche", for google, main revenue stream is supported by several pillars: search tech, infra tech, ads tech, ecosystem+network effect, human management. You remove one pillar, and everything is destroyed, so one can say ads is one of the niches in their food chain. I suspect with proteins it is about the same.

> in the context of pharma

there is no context of pharma. Post is about more broad bio-medical publications.


I'd say it's less like pillars and more like (emergence) layers, with proteins being a pretty important layer .

I'm now not entirely sure what your experience is with bio sciences. You're definitely coming at it from an odd angle though!


I didn't claim expertise, that's why I say "it looks like", "I suspect",


Well, maybe find some time and dive in a bit and see what can be found?

You never know, maybe you'll end up contributing to our understanding of life, maybe (indirectly) even save a few lives!


I am working on the service which potentially can answer questions like in this comment: https://news.ycombinator.com/item?id=38109294

life science is one of potential applications if there is an interest and money.


So the pathway to synthesize every protein is +/- the same: That's gene transcription[1] and translation[2]. If that's broken, you're in big trouble!

But if you mean in general if you're capable of looking at metabolic pathways where each protein catalyses a step in the pathway, that's definitely interesting. If a certain person has a flawed gene coding for protein X, that could indeed cause a problem.

To find valid answers, you might need to eg. track nodes and states in a graph, to figure all the consequences of a break. Not all types of storage systems/engines are equally good at that.

[1] https://en.wikipedia.org/wiki/Transcription_(biology)

[2] https://en.wikipedia.org/wiki/Translation_(biology)

edit: s/protein pathway/metabolic pathway/


> Not all types of storage systems/engines are equally good at that.

Yes, I built system which traverses paths in graphs with 1B nodes and 10B links in 1h on affordable server. But that's only one part of the puzzle.


Neat!


> Word of advice to all those who are chomping at the bit to disrupt pharma with AI.

Literally the first line in the comment that started this thread.


Sure, now let's read the post?


I read the post before I made any criticism of your comments. We're talking on a thread within the larger context of comments on the post. But more importantly, if you read the post, you will see there is a theme of industrial biochemistry (IE, pharma and biotech) running through it, because pharma/biotech is the primary consumer of these products, and the vast majority of the revenue stream.


> there is a theme of industrial biochemistry

that's one of the themes (and you are already working hard to stretch drugs pharma to "biochemistry"), if you can't see other themes in his examples and screenshots, I think this discussion is not interesting to me.


The first Google search link you've provided is focused on proteins as an active ingredient (like an antibody), not the targets.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6314433/ Check out this article for example.


So, that link says they discovered 1.5k FDA approved drugs which target proteins, while FDA has total 19k drugs approved: https://www.fda.gov/media/115824/download#:~:text=FDA%20regu....


Hi there. Apologies for any confusion. I was clarifying the intent of the first comment in this thread. I will add some impressionistic comments below in case it helps your future service.

Even when a drug target is unknown, or remains mysterious, very high chances are that the target is a protein. DNA or RNA as targets are niche (DNA is often avoided as a target on purpose and it’s hard to be specific to it without DNA-like material, and RNA is still hard to target effectively, though things are improving). There is not much else of use in the cells (lipids, sugars, cofactors, and metabolites, some examples of which have been targeted by a couple drugs each over the long history of trials, often unintentionally.

Small molecules are an excellent modality for an eventual approved therapy. They almost always (so far) targets a protein. They are hard to design but when they are done well they expose the target to something nature hasn’t seen before in order to get a desired effect. Sometimes people don’t care about the target itself (think recreational drugs, or phenotypic drug discovery), but the target typically remains a protein.

Proteins make up the machinery of the cell. You jam or modify them to achieve desired effects.


Also, and totally minor: there are nowhere close to 19k different approved small molecules. The same drug can be included in multiple products or formulations bringing that number you mentioned to 19k marketed products. Each generic formulation of iboprofen increases the latter count by 1. Counting all the marketed products of pure orange juice may add up to a large number but it is still one ingredient.


Yes, I compare drugs with drugs, not molecules with drugs.


No. The analysis in the paper above referred to molecules vs the snippet in the fda referred to products. More generally, other than marketing materials to doctors or patients or very rare exceptions talking about formulations etc, the scientific literature refers to the molecular entity as a drug not the particular named/branded product. I hope that the tool you are building will not mix up such concepts.


The link you posted found "1,578 US FDA-approved drugs"(exact citation).


Yes that is the standard language in the field. It does not refer to the number of marketed products. Look up the number of novel drugs approved by the FDA per year. It only recently exceeded 40 per year, and the FDA has not existed for very long.


if we consider that you are not making things up(which I am very not confident), then your link is useless in this discussion, because it doesn't give a sense how many non-protein targeting drugs were approved by FDA.


and an increasing number of drugs- in the "old days" it was almost entirely small molecules, but they are starting to peter out (both because the low-hanging fruit has already been plucked, and also, small molecules are a usually a terrible way to modulate biological activity in specific ways.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: